Qwen
/

Qwen2-72B-Instruct

@@ -130,6 +130,33 @@ Or you can install vLLM from [source](https://github.com/vllm-project/vllm/).
 **Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.
 ## Citation
 If you find our work helpful, feel free to give us a cite.

 **Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.
+## Evaluation
+We briefly compare Qwen2-72B-Instruct with similar-sized instruction-tuned LLMs, including our previous Qwen1.5-72B-Chat. The results are shown as follows:
+| Datasets | Llama-3-70B-Instruct | Qwen1.5-72B-Chat | **Qwen2-72B-Instruct** |
+| :--- | :---: | :---: | :---: |
+| _**English**_ |  |  |  |
+| MMLU | 82.0 | 75.6 | **82.3** |
+| MMLU-Pro | 56.2 | 51.7 | **64.4** |
+| GPQA | 41.9 | 39.4 | **42.4** |
+| TheroemQA | 42.5 | 28.8 | **44.4** |
+| MT-Bench | 8.95 | 8.61 | **9.12** |
+| Arena-Hard | 41.1 | 36.1 | **48.1** |
+| IFEval (Prompt Strict-Acc.) | 77.3 | 55.8 | **77.6** |
+| _**Coding**_ |  |  |  |
+| HumanEval | 81.7 | 71.3 | **86.0** |
+| MBPP | **82.3** | 71.9 | 80.2 |
+| MultiPL-E | 63.4 | 48.1 | **69.2** |
+| EvalPlus | 75.2 | 66.9 | **79.0** |
+| LiveCodeBench | 29.3 | 17.9 | **35.7** |
+| _**Mathematics**_ |  |  |  |
+| GSM8K | **93.0** | 82.7 | 91.1 |
+| MATH | 50.4 | 42.5 | **59.7** |
+| _**Chinese**_ |  |  |  |
+| C-Eval | 61.6 | 76.1 | **83.8** |
+| AlignBench | 7.42 | 7.28 | **8.27** |
 ## Citation
 If you find our work helpful, feel free to give us a cite.