File size: 15,295 Bytes
d4d334b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
mistralai/Mistral-7B-Instruct-v0.2

|                 Tasks                 |Version|     Filter     |n-shot|  Metric   | Value |   |Stderr|                                                                    
|---------------------------------------|-------|----------------|-----:|-----------|------:|---|-----:|                                                                              
|winogrande                             |      1|none            |     0|acc        | 0.7364|±  |0.0124|                                                                              
|truthfulqa                             |N/A    |none            |     0|acc        | 0.5954|±  |0.0116|                                                                              
|                                       |       |none            |     0|rouge1_max |46.4534|±  |0.8502|                                                                              
|                                       |       |none            |     0|rougeL_diff| 5.5378|±  |0.8859|                                                                              
|                                       |       |none            |     0|bleu_acc   | 0.5483|±  |0.0174|                                                                              
|                                       |       |none            |     0|rouge2_max |31.1969|±  |0.9785|                                                                              
|                                       |       |none            |     0|rougeL_max |43.4263|±  |0.8666|                                                                              
|                                       |       |none            |     0|rougeL_acc | 0.5606|±  |0.0174|                                                                              
|                                       |       |none            |     0|rouge2_acc | 0.4529|±  |0.0174|                                                                              
|                                       |       |none            |     0|rouge2_diff| 5.3591|±  |0.9416|                                                                              
|                                       |       |none            |     0|bleu_max   |21.2977|±  |0.7504|                                                                              
|                                       |       |none            |     0|rouge1_acc | 0.5741|±  |0.0173|                                                                              
|                                       |       |none            |     0|bleu_diff  | 4.3215|±  |0.6161|                                                                              
|                                       |       |none            |     0|rouge1_diff| 5.7381|±  |0.8786|                                                                              
| - truthfulqa_gen                      |      3|none            |     0|bleu_max   |21.2977|±  |0.7504|                                                                              
|                                       |       |none            |     0|bleu_acc   | 0.5483|±  |0.0174|                                                                              
|                                       |       |none            |     0|bleu_diff  | 4.3215|±  |0.6161|                                                                              
|                                       |       |none            |     0|rouge1_max |46.4534|±  |0.8502|                                                                              
|                                       |       |none            |     0|rouge1_acc | 0.5741|±  |0.0173|                                                                              
|                                       |       |none            |     0|rouge1_diff| 5.7381|±  |0.8786|                                                                              
|                                       |       |none            |     0|rouge2_max |31.1969|±  |0.9785|                                                                              
|                                       |       |none            |     0|rouge2_acc | 0.4529|±  |0.0174|                                                                              
|                                       |       |none            |     0|rouge2_diff| 5.3591|±  |0.9416|                                                                              
|                                       |       |none            |     0|rougeL_max |43.4263|±  |0.8666|                                                                              
|                                       |       |none            |     0|rougeL_acc | 0.5606|±  |0.0174|                                                                              
|                                       |       |none            |     0|rougeL_diff| 5.5378|±  |0.8859|                                                                              
| - truthfulqa_mc1                      |      2|none            |     0|acc        | 0.5226|±  |0.0175|                                                                              
| - truthfulqa_mc2                      |      2|none            |     0|acc        | 0.6681|±  |0.0153|                                                                              
|piqa                                   |      1|none            |     0|acc        | 0.8003|±  |0.0093|                                                                              
|                                       |       |none            |     0|acc_norm   | 0.8047|±  |0.0092|                                                                              
|openbookqa                             |      1|none            |     0|acc        | 0.3600|±  |0.0215|                                                                              
|                                       |       |none            |     0|acc_norm   | 0.4520|±  |0.0223|
|mmlu                                   |N/A    |none            |     0|acc        | 0.5879|±  |0.0039|
| - humanities                          |N/A    |none            |     0|acc        | 0.5396|±  |0.0069|
|  - formal_logic                       |      0|none            |     0|acc        | 0.3651|±  |0.0431|
|  - high_school_european_history       |      0|none            |     0|acc        | 0.7273|±  |0.0348|
|  - high_school_us_history             |      0|none            |     0|acc        | 0.7794|±  |0.0291|
|  - high_school_world_history          |      0|none            |     0|acc        | 0.7764|±  |0.0271|
|  - international_law                  |      0|none            |     0|acc        | 0.7438|±  |0.0398|
|  - jurisprudence                      |      0|none            |     0|acc        | 0.7130|±  |0.0437|
|  - logical_fallacies                  |      0|none            |     0|acc        | 0.7546|±  |0.0338|
|  - moral_disputes                     |      0|none            |     0|acc        | 0.6532|±  |0.0256|
|  - moral_scenarios                    |      0|none            |     0|acc        | 0.3564|±  |0.0160|
|  - philosophy                         |      0|none            |     0|acc        | 0.6463|±  |0.0272|
|  - prehistory                         |      0|none            |     0|acc        | 0.6821|±  |0.0259|
|  - professional_law                   |      0|none            |     0|acc        | 0.4133|±  |0.0126|
|  - world_religions                    |      0|none            |     0|acc        | 0.8129|±  |0.0299|
| - other                               |N/A    |none            |     0|acc        | 0.6621|±  |0.0082|
|  - business_ethics                    |      0|none            |     0|acc        | 0.5900|±  |0.0494|
|  - clinical_knowledge                 |      0|none            |     0|acc        | 0.6491|±  |0.0294|
|  - college_medicine                   |      0|none            |     0|acc        | 0.5549|±  |0.0379|
|  - global_facts                       |      0|none            |     0|acc        | 0.3800|±  |0.0488|
|  - human_aging                        |      0|none            |     0|acc        | 0.6233|±  |0.0325|
|  - management                         |      0|none            |     0|acc        | 0.7184|±  |0.0445|
|  - marketing                          |      0|none            |     0|acc        | 0.8761|±  |0.0216|
|  - medical_genetics                   |      0|none            |     0|acc        | 0.6500|±  |0.0479|
|  - miscellaneous                      |      0|none            |     0|acc        | 0.7944|±  |0.0145|
|  - nutrition                          |      0|none            |     0|acc        | 0.6732|±  |0.0269|
|  - professional_accounting            |      0|none            |     0|acc        | 0.4468|±  |0.0297|
|  - professional_medicine              |      0|none            |     0|acc        | 0.6581|±  |0.0288|
|  - virology                           |      0|none            |     0|acc        | 0.4578|±  |0.0388|
| - social_sciences                     |N/A    |none            |     0|acc        | 0.6799|±  |0.0082|
|  - econometrics                       |      0|none            |     0|acc        | 0.4649|±  |0.0469|
|  - high_school_geography              |      0|none            |     0|acc        | 0.7374|±  |0.0314|
|  - high_school_government_and_politics|      0|none            |     0|acc        | 0.8031|±  |0.0287|
|  - high_school_macroeconomics         |      0|none            |     0|acc        | 0.5590|±  |0.0252|
|  - high_school_microeconomics         |      0|none            |     0|acc        | 0.6387|±  |0.0312|
|  - high_school_psychology             |      0|none            |     0|acc        | 0.7853|±  |0.0176|
|  - human_sexuality                    |      0|none            |     0|acc        | 0.6794|±  |0.0409|
|  - professional_psychology            |      0|none            |     0|acc        | 0.5866|±  |0.0199|
|  - public_relations                   |      0|none            |     0|acc        | 0.6455|±  |0.0458|
|  - security_studies                   |      0|none            |     0|acc        | 0.6816|±  |0.0298|
|  - sociology                          |      0|none            |     0|acc        | 0.8408|±  |0.0259|
|  - us_foreign_policy                  |      0|none            |     0|acc        | 0.8500|±  |0.0359|
| - stem                                |N/A    |none            |     0|acc        | 0.4970|±  |0.0087|
|  - abstract_algebra                   |      0|none            |     0|acc        | 0.3200|±  |0.0469|
|  - anatomy                            |      0|none            |     0|acc        | 0.5704|±  |0.0428|
|  - astronomy                          |      0|none            |     0|acc        | 0.6382|±  |0.0391|
|  - college_biology                    |      0|none            |     0|acc        | 0.6597|±  |0.0396|
|  - college_chemistry                  |      0|none            |     0|acc        | 0.4100|±  |0.0494|
|  - college_computer_science           |      0|none            |     0|acc        | 0.5400|±  |0.0501|
|  - college_mathematics                |      0|none            |     0|acc        | 0.3400|±  |0.0476|                                                                    
|  - college_physics                    |      0|none            |     0|acc        | 0.3725|±  |0.0481|
|  - computer_security                  |      0|none            |     0|acc        | 0.6700|±  |0.0473|
|  - conceptual_physics                 |      0|none            |     0|acc        | 0.4809|±  |0.0327|
|  - electrical_engineering             |      0|none            |     0|acc        | 0.5931|±  |0.0409|
|  - elementary_mathematics             |      0|none            |     0|acc        | 0.4233|±  |0.0254|
|  - high_school_biology                |      0|none            |     0|acc        | 0.6774|±  |0.0266|
|  - high_school_chemistry              |      0|none            |     0|acc        | 0.4877|±  |0.0352|
|  - high_school_computer_science       |      0|none            |     0|acc        | 0.6100|±  |0.0490|
|  - high_school_mathematics            |      0|none            |     0|acc        | 0.3556|±  |0.0292|
|  - high_school_physics                |      0|none            |     0|acc        | 0.3642|±  |0.0393|
|  - high_school_statistics             |      0|none            |     0|acc        | 0.4630|±  |0.0340|
|  - machine_learning                   |      0|none            |     0|acc        | 0.4643|±  |0.0473|
|hellaswag                              |      1|none            |     0|acc        | 0.6608|±  |0.0047|
|                                       |       |none            |     0|acc_norm   | 0.8368|±  |0.0037|
|gsm8k                                  |      3|strict-match    |     5|exact_match| 0.4155|±  |0.0136|
|                                       |       |flexible-extract|     5|exact_match| 0.4193|±  |0.0136|
|boolq                                  |      2|none            |     0|acc        | 0.8529|±  |0.0062|
|arc_easy                               |      1|none            |     0|acc        | 0.8136|±  |0.0080|
|                                       |       |none            |     0|acc_norm   | 0.7660|±  |0.0087|
|arc_challenge                          |      1|none            |     0|acc        | 0.5435|±  |0.0146|
|                                       |       |none            |     0|acc_norm   | 0.5580|±  |0.0145|

|      Groups      |Version|Filter|n-shot|  Metric   | Value |   |Stderr|
|------------------|-------|------|-----:|-----------|------:|---|-----:|
|truthfulqa        |N/A    |none  |     0|acc        | 0.5954|±  |0.0116|
|                  |       |none  |     0|rouge1_max |46.4534|±  |0.8502|
|                  |       |none  |     0|rougeL_diff| 5.5378|±  |0.8859|
|                  |       |none  |     0|bleu_acc   | 0.5483|±  |0.0174|
|                  |       |none  |     0|rouge2_max |31.1969|±  |0.9785|
|                  |       |none  |     0|rougeL_max |43.4263|±  |0.8666|
|                  |       |none  |     0|rougeL_acc | 0.5606|±  |0.0174|
|                  |       |none  |     0|rouge2_acc | 0.4529|±  |0.0174|
|                  |       |none  |     0|rouge2_diff| 5.3591|±  |0.9416|
|                  |       |none  |     0|bleu_max   |21.2977|±  |0.7504|
|                  |       |none  |     0|rouge1_acc | 0.5741|±  |0.0173|
|                  |       |none  |     0|bleu_diff  | 4.3215|±  |0.6161|
|                  |       |none  |     0|rouge1_diff| 5.7381|±  |0.8786|
|mmlu              |N/A    |none  |     0|acc        | 0.5879|±  |0.0039|
| - humanities     |N/A    |none  |     0|acc        | 0.5396|±  |0.0069|
| - other          |N/A    |none  |     0|acc        | 0.6621|±  |0.0082|
| - social_sciences|N/A    |none  |     0|acc        | 0.6799|±  |0.0082|
| - stem           |N/A    |none  |     0|acc        | 0.4970|±  |0.0087|