alexmarques commited on
Commit
9db6306
·
verified ·
1 Parent(s): d0c9cb9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +178 -20
README.md CHANGED
@@ -31,8 +31,9 @@ base_model: meta-llama/Meta-Llama-3.1-405B-Instruct
31
  - **License(s):** Llama3.1
32
  - **Model Developers:** Neural Magic
33
 
34
- Quantized version of [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct).
35
- It achieves scores within 1% of the scores of the unquantized model for MMLU, ARC-Challenge, GSM-8k, Hellaswag, Winogrande, and TruthfulQA.
 
36
 
37
  ### Model Optimizations
38
 
@@ -128,9 +129,19 @@ model.save_pretrained("Meta-Llama-3.1-405B-Instruct-quantized.w4a16")
128
 
129
  ## Evaluation
130
 
131
- The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
132
- Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
133
- This version of the lm-evaluation-harness includes versions of ARC-Challenge, GSM-8K, and MMLU that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-405B-Instruct-evals).
 
 
 
 
 
 
 
 
 
 
134
 
135
  **Note:** Results have been updated after Meta modified the chat template.
136
 
@@ -148,12 +159,26 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
148
  <td><strong>Recovery</strong>
149
  </td>
150
  </tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
151
  <tr>
152
  <td>MMLU (5-shot)
153
  </td>
154
- <td>87.38
155
  </td>
156
- <td>87.22
157
  </td>
158
  <td>99.8%
159
  </td>
@@ -161,9 +186,9 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
161
  <tr>
162
  <td>ARC Challenge (0-shot)
163
  </td>
164
- <td>94.97
165
  </td>
166
- <td>95.31
167
  </td>
168
  <td>100.4%
169
  </td>
@@ -171,9 +196,9 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
171
  <tr>
172
  <td>GSM-8K (CoT, 8-shot, strict-match)
173
  </td>
174
- <td>96.44
175
  </td>
176
- <td>96.29
177
  </td>
178
  <td>99.8%
179
  </td>
@@ -181,9 +206,9 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
181
  <tr>
182
  <td>Hellaswag (10-shot)
183
  </td>
184
- <td>88.33
185
  </td>
186
- <td>88.27
187
  </td>
188
  <td>99.9%
189
  </td>
@@ -191,9 +216,9 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
191
  <tr>
192
  <td>Winogrande (5-shot)
193
  </td>
194
- <td>87.21
195
  </td>
196
- <td>87.37
197
  </td>
198
  <td>100.2%
199
  </td>
@@ -201,9 +226,9 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
201
  <tr>
202
  <td>TruthfulQA (0-shot)
203
  </td>
204
- <td>64.64
205
  </td>
206
- <td>65.26
207
  </td>
208
  <td>101.0%
209
  </td>
@@ -211,13 +236,111 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
211
  <tr>
212
  <td><strong>Average</strong>
213
  </td>
214
- <td><strong>86.75</strong>
215
  </td>
216
- <td><strong>86.76</strong>
217
  </td>
218
  <td><strong>100.0%</strong>
219
  </td>
220
  </tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
221
  </table>
222
 
223
  ### Reproduction
@@ -287,4 +410,39 @@ lm_eval \
287
  --tasks truthfulqa \
288
  --num_fewshot 0 \
289
  --batch_size auto
290
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  - **License(s):** Llama3.1
32
  - **Model Developers:** Neural Magic
33
 
34
+ This model is a quantized version of [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct).
35
+ It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model, including multiple-choice, math reasoning, and open-ended text generation.
36
+ Meta-Llama-3.1-405B-Instruct-quantized.w4a16 achieves 98.7% recovery for the Arena-Hard evaluation, 100.0% for OpenLLM v1 (using Meta's prompting when available), 99.0% for OpenLLM v2, 98.0% for HumanEval pass@1, and 98.5% for HumanEval+ pass@1.
37
 
38
  ### Model Optimizations
39
 
 
129
 
130
  ## Evaluation
131
 
132
+ This model was evaluated on the well-known Arena-Hard, OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval+ benchmarks.
133
+ In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/en/stable/) engine.
134
+
135
+ Arena-Hard evaluations were conducted using the [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) repository.
136
+ The model generated a single answer for each prompt form Arena-Hard, and each answer was judged twice by GPT-4.
137
+ We report below the scores obtained in each judgement and the average.
138
+
139
+ OpenLLM v1 and v2 evaluations were conducted using Neural Magic's fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct).
140
+ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-405B-Instruct-evals) and a few fixes to OpenLLM v2 tasks.
141
+
142
+ HumanEval and HumanEval+ evaluations were conducted using Neural Magic's fork of the [EvalPlus](https://github.com/neuralmagic/evalplus) repository.
143
+
144
+ Detailed model outputs are available as HuggingFace datasets for [Arena-Hard](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-arena-hard-evals), [OpenLLM v2](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-leaderboard-v2-evals), and [HumanEval](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-humaneval-evals).
145
 
146
  **Note:** Results have been updated after Meta modified the chat template.
147
 
 
159
  <td><strong>Recovery</strong>
160
  </td>
161
  </tr>
162
+ <tr>
163
+ <td><strong>Arena Hard</strong>
164
+ </td>
165
+ <td>67.4 (67.3 / 67.5)
166
+ </td>
167
+ <td>66.5 (66.5 / 66.4)
168
+ </td>
169
+ <td>98.7%
170
+ </td>
171
+ </tr>
172
+ <tr>
173
+ <td><strong>OpenLLM v1</strong>
174
+ </td>
175
+ </tr>
176
  <tr>
177
  <td>MMLU (5-shot)
178
  </td>
179
+ <td>87.4
180
  </td>
181
+ <td>87.2
182
  </td>
183
  <td>99.8%
184
  </td>
 
186
  <tr>
187
  <td>ARC Challenge (0-shot)
188
  </td>
189
+ <td>95.0
190
  </td>
191
+ <td>95.3
192
  </td>
193
  <td>100.4%
194
  </td>
 
196
  <tr>
197
  <td>GSM-8K (CoT, 8-shot, strict-match)
198
  </td>
199
+ <td>96.4
200
  </td>
201
+ <td>96.3
202
  </td>
203
  <td>99.8%
204
  </td>
 
206
  <tr>
207
  <td>Hellaswag (10-shot)
208
  </td>
209
+ <td>88.3
210
  </td>
211
+ <td>88.3
212
  </td>
213
  <td>99.9%
214
  </td>
 
216
  <tr>
217
  <td>Winogrande (5-shot)
218
  </td>
219
+ <td>87.2
220
  </td>
221
+ <td>87.4
222
  </td>
223
  <td>100.2%
224
  </td>
 
226
  <tr>
227
  <td>TruthfulQA (0-shot)
228
  </td>
229
+ <td>64.6
230
  </td>
231
+ <td>65.3
232
  </td>
233
  <td>101.0%
234
  </td>
 
236
  <tr>
237
  <td><strong>Average</strong>
238
  </td>
239
+ <td><strong>86.8</strong>
240
  </td>
241
+ <td><strong>86.8</strong>
242
  </td>
243
  <td><strong>100.0%</strong>
244
  </td>
245
  </tr>
246
+ <tr>
247
+ <td><strong>OpenLLM v2</strong>
248
+ </td>
249
+ </tr>
250
+ <tr>
251
+ <td>MMLU-Pro (5-shot)
252
+ </td>
253
+ <td>59.7
254
+ </td>
255
+ <td>59.4
256
+ </td>
257
+ <td>99.3%
258
+ </td>
259
+ </tr>
260
+ <tr>
261
+ <td>IFEval (0-shot)
262
+ </td>
263
+ <td>87.7
264
+ </td>
265
+ <td>88.0
266
+ </td>
267
+ <td>100.4%
268
+ </td>
269
+ </tr>
270
+ <tr>
271
+ <td>BBH (3-shot)
272
+ </td>
273
+ <td>67.0
274
+ </td>
275
+ <td>67.5
276
+ </td>
277
+ <td>100.7%
278
+ </td>
279
+ </tr>
280
+ <tr>
281
+ <td>Math-|v|-5 (4-shot)
282
+ </td>
283
+ <td>39.0
284
+ </td>
285
+ <td>37.6
286
+ </td>
287
+ <td>96.5%
288
+ </td>
289
+ </tr>
290
+ <tr>
291
+ <td>GPQA (0-shot)
292
+ </td>
293
+ <td>19.5
294
+ </td>
295
+ <td>17.5
296
+ </td>
297
+ <td>89.8%
298
+ </td>
299
+ </tr>
300
+ <tr>
301
+ <td>MuSR (0-shot)
302
+ </td>
303
+ <td>19.5
304
+ </td>
305
+ <td>19.4
306
+ </td>
307
+ <td>99.5%
308
+ </td>
309
+ </tr>
310
+ <tr>
311
+ <td><strong>Average</strong>
312
+ </td>
313
+ <td><strong>48.7</strong>
314
+ </td>
315
+ <td><strong>48.2</strong>
316
+ </td>
317
+ <td><strong>99.0%</strong>
318
+ </td>
319
+ </tr>
320
+ <tr>
321
+ <td><strong>Coding</strong>
322
+ </td>
323
+ </tr>
324
+ <tr>
325
+ <td>HumanEval pass@1
326
+ </td>
327
+ <td>86.8
328
+ </td>
329
+ <td>85.1
330
+ </td>
331
+ <td>98.0%
332
+ </td>
333
+ </tr>
334
+ <tr>
335
+ <td>HumanEval+ pass@1
336
+ </td>
337
+ <td>80.1
338
+ </td>
339
+ <td>78.9
340
+ </td>
341
+ <td>98.5%
342
+ </td>
343
+ </tr>
344
  </table>
345
 
346
  ### Reproduction
 
410
  --tasks truthfulqa \
411
  --num_fewshot 0 \
412
  --batch_size auto
413
+ ```
414
+
415
+ #### OpenLLM v2
416
+ ```
417
+ lm_eval \
418
+ --model vllm \
419
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16",dtype=auto,max_model_len=4096,tensor_parallel_size=8,enable_chunked_prefill=True \
420
+ --apply_chat_template \
421
+ --fewshot_as_multiturn \
422
+ --tasks leaderboard \
423
+ --batch_size auto
424
+ ```
425
+
426
+ #### HumanEval and HumanEval+
427
+ ##### Generation
428
+ ```
429
+ python3 codegen/generate.py \
430
+ --model neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16 \
431
+ --bs 16 \
432
+ --temperature 0.2 \
433
+ --n_samples 50 \
434
+ --root "." \
435
+ --dataset humaneval \
436
+ --tp 8
437
+ ```
438
+ ##### Sanitization
439
+ ```
440
+ python3 evalplus/sanitize.py \
441
+ humaneval/neuralmagic--Meta-Llama-3.1-405B-Instruct-quantized.w4a16_vllm_temp_0.2
442
+ ```
443
+ ##### Evaluation
444
+ ```
445
+ evalplus.evaluate \
446
+ --dataset humaneval \
447
+ --samples humaneval/neuralmagic--Meta-Llama-3.1-405B-Instruct-quantized.w4a16_vllm_temp_0.2-sanitized
448
+ ```