Dongfu Jiang
commited on
Commit
·
c1046ee
1
Parent(s):
5e3cb95
Update README.md
Browse files
README.md
CHANGED
@@ -224,8 +224,9 @@ We test the pairwise comparison on
|
|
224 |
| Vicuna -13B-v1.5 | 30.6 | 23.6 | 35 | 28.3 | 36.1 | 37.5 | 45.5 | 39.8 | 37.3 |
|
225 |
| WizardLM -13B-v1.2 | 22.2 | 20.8 | 32.5 | 19.2 | 28.7 | 25.4 | 29.2 | 33 | 27.8 |
|
226 |
| LLAMA -2-chat -70B | 34.7 | 33.3 | 36.7 | 35.8 | 51.4 | 54.2 | 47.2 | 47.7 | 45.9 |
|
227 |
-
| AUTO -J (13b) | 45.8 | 38.9 | 59.2 | 47.5 | 54.6 | 57.1 | **58** | 57.6 | 54.8 |
|
228 |
-
|
|
|
|
229 |
|
230 |
#### HHH-Alignment and MT-bench human judgements
|
231 |
|
@@ -242,7 +243,8 @@ We test the pairwise comparison on
|
|
242 |
| GPT -3.5-TURBO -0613 | 76.27 | 87.93 | 67.21 | 86.05 | 78.73 | 57.12 |
|
243 |
| PROMETHEUS 7B | 69.49 | 84.48 | 78.69 | 90.7 | 80.09 | 55.14 |
|
244 |
| PROMETHEUS 13B | 81.36 | 82.76 | 75.41 | 76.74 | 79.19 | 57.72 |
|
245 |
-
|
|
|
|
246 |
| GPT -4-0613 | 91.53 | 93.1 | 85.25 | 83.72 | 88.69 | 63.87 |
|
247 |
|
248 |
**While PairRM is a extremely small model (0.4B) based on deberta, the pairwise comparison aggrement performance approches GPT-4's performance!**
|
|
|
224 |
| Vicuna -13B-v1.5 | 30.6 | 23.6 | 35 | 28.3 | 36.1 | 37.5 | 45.5 | 39.8 | 37.3 |
|
225 |
| WizardLM -13B-v1.2 | 22.2 | 20.8 | 32.5 | 19.2 | 28.7 | 25.4 | 29.2 | 33 | 27.8 |
|
226 |
| LLAMA -2-chat -70B | 34.7 | 33.3 | 36.7 | 35.8 | 51.4 | 54.2 | 47.2 | 47.7 | 45.9 |
|
227 |
+
| AUTO -J (13b) | 45.8 | 38.9 | **59.2** | 47.5 | 54.6 | 57.1 | **58** | 57.6 | 54.8 |
|
228 |
+
| UltraRM (13b) | 56.94 | 43.06 | 55.0 | 53.33 | 67.13 | **64.17** | 56.25 | 59.85 | **59.85** |
|
229 |
+
| **PairRM (0.4b)** | **56.94** | **52.78** | 58.33 | **55.83** | **61.57** | 59.17 | 57.64 | **62.5** | 59.05 |
|
230 |
|
231 |
#### HHH-Alignment and MT-bench human judgements
|
232 |
|
|
|
243 |
| GPT -3.5-TURBO -0613 | 76.27 | 87.93 | 67.21 | 86.05 | 78.73 | 57.12 |
|
244 |
| PROMETHEUS 7B | 69.49 | 84.48 | 78.69 | 90.7 | 80.09 | 55.14 |
|
245 |
| PROMETHEUS 13B | 81.36 | 82.76 | 75.41 | 76.74 | 79.19 | 57.72 |
|
246 |
+
| UltraRM (13B) | **86.44** | 79.31 | **81.97** | 88.37 | 83.71 | 56 |
|
247 |
+
| **PairRM (0.4B)** | 84.75 | 84.48 | 80.33 | **90.7** | **84.62** | **59** |
|
248 |
| GPT -4-0613 | 91.53 | 93.1 | 85.25 | 83.72 | 88.69 | 63.87 |
|
249 |
|
250 |
**While PairRM is a extremely small model (0.4B) based on deberta, the pairwise comparison aggrement performance approches GPT-4's performance!**
|