Update README.md
Browse files
README.md
CHANGED
@@ -122,6 +122,31 @@ For deployment, we recommend using vLLM. You can enable the long-context capabil
|
|
122 |
|
123 |
**Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.
|
124 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
125 |
## Citation
|
126 |
|
127 |
If you find our work helpful, feel free to give us a cite.
|
|
|
122 |
|
123 |
**Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.
|
124 |
|
125 |
+
## Evaluation
|
126 |
+
|
127 |
+
We briefly compare Qwen2-7B-Instruct with similiar-sized instruction-tuned LLMs, including Qwen1.5-7B-Chat. The results are shown below:
|
128 |
+
|
129 |
+
| Datasets | Llama-3-8B-Instruct | Yi-1.5-9B-Chat | GLM-4-9B-Chat | Qwen1.5-7B-Chat | Qwen2-7B-Instruct |
|
130 |
+
| :--- | :---: | :---: | :---: | :---: | :---: |
|
131 |
+
| _**English**_ | | | | | |
|
132 |
+
| MMLU | 68.4 | 69.5 | **72.4** | 59.5 | 70.5 |
|
133 |
+
| MMLU-Pro | 41.0 | - | - | 29.1 | **44.1** |
|
134 |
+
| GPQA | **34.2** | - | **-** | 27.8 | 25.3 |
|
135 |
+
| TheroemQA | 23.0 | - | - | 14.1 | **25.3** |
|
136 |
+
| MT-Bench | 8.05 | 8.20 | 8.35 | 7.60 | **8.41** |
|
137 |
+
| _**Coding**_ | | | | | |
|
138 |
+
| Humaneval | 62.2 | 66.5 | 71.8 | 46.3 | **79.9** |
|
139 |
+
| MBPP | **67.9** | - | - | 48.9 | 67.2 |
|
140 |
+
| MultiPL-E | 48.5 | - | - | 27.2 | **59.1** |
|
141 |
+
| Evalplus | 60.9 | - | - | 44.8 | **70.3** |
|
142 |
+
| LiveCodeBench | 17.3 | - | - | 6.0 | **26.6** |
|
143 |
+
| _**Mathematics**_ | | | | | |
|
144 |
+
| GSM8K | 79.6 | **84.8** | 79.6 | 60.3 | 82.3 |
|
145 |
+
| MATH | 30.0 | 47.7 | **50.6** | 23.2 | 49.6 |
|
146 |
+
| _**Chinese**_ | | | | | |
|
147 |
+
| C-Eval | 45.9 | - | 75.6 | 67.3 | **77.2** |
|
148 |
+
| AlignBench | 6.20 | 6.90 | 7.01 | 6.20 | **7.21** |
|
149 |
+
|
150 |
## Citation
|
151 |
|
152 |
If you find our work helpful, feel free to give us a cite.
|