Update README.md
Browse files
README.md
CHANGED
@@ -124,6 +124,33 @@ For deployment, we recommend using vLLM. You can enable the long-context capabil
|
|
124 |
|
125 |
**Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.
|
126 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
127 |
## Citation
|
128 |
|
129 |
If you find our work helpful, feel free to give us a cite.
|
|
|
124 |
|
125 |
**Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.
|
126 |
|
127 |
+
## Evaluation
|
128 |
+
|
129 |
+
We briefly compare Qwen2-72B-Instruct with similiar-sized instruction-tuned LLMs. The results are shown as follows:
|
130 |
+
|
131 |
+
| Datasets | Llama-3-70B-Instruct | Qwen1.5-72B-Chat | **Qwen2-72B-Instruct** |
|
132 |
+
| :--- | :---: | :---: | :---: |
|
133 |
+
| _**English**_ | | | |
|
134 |
+
| MMLU | 82.0 | 75.6 | **82.3** |
|
135 |
+
| MMLU-Pro | 56.2 | 51.7 | **64.4** |
|
136 |
+
| GPQA | 41.9 | 39.4 | **42.4** |
|
137 |
+
| TheroemQA | 42.5 | 28.8 | **44.4** |
|
138 |
+
| MT-Bench | 8.95 | 8.61 | **9.12** |
|
139 |
+
| Arena-Hard | 41.1 | 36.1 | **48.1** |
|
140 |
+
| IFEval (Prompt Strict-Acc.) | 77.3 | 55.8 | **77.6** |
|
141 |
+
| _**Coding**_ | | | |
|
142 |
+
| HumanEval | 81.7 | 71.3 | **86.0** |
|
143 |
+
| MBPP | **82.3** | 71.9 | 80.2 |
|
144 |
+
| MultiPL-E | 63.4 | 48.1 | **69.2** |
|
145 |
+
| EvalPlus | 75.2 | 66.9 | **79.0** |
|
146 |
+
| LiveCodeBench | 29.3 | 17.9 | **35.7** |
|
147 |
+
| _**Mathematics**_ | | | |
|
148 |
+
| GSM8K | **93.0** | 82.7 | 91.1 |
|
149 |
+
| MATH | 50.4 | 42.5 | **59.7** |
|
150 |
+
| _**Chinese**_ | | | |
|
151 |
+
| C-Eval | 61.6 | 76.1 | **83.8** |
|
152 |
+
| AlignBench | 7.42 | 7.28 | **8.27** |
|
153 |
+
|
154 |
## Citation
|
155 |
|
156 |
If you find our work helpful, feel free to give us a cite.
|