File size: 15,447 Bytes
d4d334b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
raidhon/coven_7b_128k_orpo_alpha

|                 Tasks                 |Version|     Filter     |n-shot|  Metric   | Value |   |Stderr|                                                                              
|---------------------------------------|-------|----------------|-----:|-----------|------:|---|-----:|                                                                              
|winogrande                             |      1|none            |     0|acc        | 0.7782|±  |0.0117|                                                                              
|truthfulqa                             |N/A    |none            |     0|rouge1_max |47.8575|±  |0.8139|                                                                              
|                                       |       |none            |     0|bleu_max   |21.9412|±  |0.7280|                                                                              
|                                       |       |none            |     0|rouge2_max |32.7726|±  |0.9228|                                                                              
|                                       |       |none            |     0|rougeL_diff|-1.4310|±  |0.7806|                                                                              
|                                       |       |none            |     0|acc        | 0.4955|±  |0.0115|                                                                              
|                                       |       |none            |     0|bleu_diff  |-0.2883|±  |0.6228|                                                                              
|                                       |       |none            |     0|rouge2_acc | 0.3807|±  |0.0170|                                                                              
|                                       |       |none            |     0|rougeL_max |44.1785|±  |0.8274|                                                                              
|                                       |       |none            |     0|rougeL_acc | 0.4443|±  |0.0174|                                                                              
|                                       |       |none            |     0|rouge2_diff|-1.5603|±  |0.8950|                                                                              
|                                       |       |none            |     0|bleu_acc   | 0.4321|±  |0.0173|                                                                              
|                                       |       |none            |     0|rouge1_diff|-0.7276|±  |0.7721|                                                                              
|                                       |       |none            |     0|rouge1_acc | 0.4774|±  |0.0175|                                                                              
| - truthfulqa_gen                      |      3|none            |     0|bleu_max   |21.9412|±  |0.7280|                                                                              
|                                       |       |none            |     0|bleu_acc   | 0.4321|±  |0.0173|                                                                              
|                                       |       |none            |     0|bleu_diff  |-0.2883|±  |0.6228|                                                                              
|                                       |       |none            |     0|rouge1_max |47.8575|±  |0.8139|                                                                              
|                                       |       |none            |     0|rouge1_acc | 0.4774|±  |0.0175|                                                                              
|                                       |       |none            |     0|rouge1_diff|-0.7276|±  |0.7721|                                                                              
|                                       |       |none            |     0|rouge2_max |32.7726|±  |0.9228|                                                                              
|                                       |       |none            |     0|rouge2_acc | 0.3807|±  |0.0170|                                                                              
|                                       |       |none            |     0|rouge2_diff|-1.5603|±  |0.8950|                                                                              
|                                       |       |none            |     0|rougeL_max |44.1785|±  |0.8274|                                                                              
|                                       |       |none            |     0|rougeL_acc | 0.4443|±  |0.0174|                                                                              
|                                       |       |none            |     0|rougeL_diff|-1.4310|±  |0.7806|                                                                              
| - truthfulqa_mc1                      |      2|none            |     0|acc        | 0.4174|±  |0.0173|                                                                              
| - truthfulqa_mc2                      |      2|none            |     0|acc        | 0.5736|±  |0.0151|                                                                              
|piqa                                   |      1|none            |     0|acc        | 0.8205|±  |0.0090|                                                                              
|                                       |       |none            |     0|acc_norm   | 0.8395|±  |0.0086|
|openbookqa                             |      1|none            |     0|acc        | 0.3460|±  |0.0213|
|                                       |       |none            |     0|acc_norm   | 0.4800|±  |0.0224|
|mmlu                                   |N/A    |none            |     0|acc        | 0.6300|±  |0.0038|
| - humanities                          |N/A    |none            |     0|acc        | 0.5779|±  |0.0066|
|  - formal_logic                       |      0|none            |     0|acc        | 0.4127|±  |0.0440|
|  - high_school_european_history       |      0|none            |     0|acc        | 0.8061|±  |0.0309|
|  - high_school_us_history             |      0|none            |     0|acc        | 0.8480|±  |0.0252|
|  - high_school_world_history          |      0|none            |     0|acc        | 0.8523|±  |0.0231|
|  - international_law                  |      0|none            |     0|acc        | 0.7934|±  |0.0370|
|  - international_law                  |      0|none            |     0|acc        | 0.7934|±  |0.0370|                                                                     [46/1966]
|  - jurisprudence                      |      0|none            |     0|acc        | 0.7685|±  |0.0408|
|  - logical_fallacies                  |      0|none            |     0|acc        | 0.7730|±  |0.0329|
|  - moral_disputes                     |      0|none            |     0|acc        | 0.7110|±  |0.0244|
|  - moral_scenarios                    |      0|none            |     0|acc        | 0.2894|±  |0.0152|
|  - philosophy                         |      0|none            |     0|acc        | 0.7106|±  |0.0258|
|  - prehistory                         |      0|none            |     0|acc        | 0.7685|±  |0.0235|
|  - professional_law                   |      0|none            |     0|acc        | 0.4824|±  |0.0128|
|  - world_religions                    |      0|none            |     0|acc        | 0.8129|±  |0.0299|
| - other                               |N/A    |none            |     0|acc        | 0.7090|±  |0.0078|
|  - business_ethics                    |      0|none            |     0|acc        | 0.5900|±  |0.0494|
|  - clinical_knowledge                 |      0|none            |     0|acc        | 0.7245|±  |0.0275|
|  - college_medicine                   |      0|none            |     0|acc        | 0.6532|±  |0.0363|
|  - global_facts                       |      0|none            |     0|acc        | 0.3200|±  |0.0469|
|  - human_aging                        |      0|none            |     0|acc        | 0.7040|±  |0.0306|
|  - management                         |      0|none            |     0|acc        | 0.7864|±  |0.0406|
|  - marketing                          |      0|none            |     0|acc        | 0.8632|±  |0.0225|
|  - medical_genetics                   |      0|none            |     0|acc        | 0.7500|±  |0.0435|
|  - miscellaneous                      |      0|none            |     0|acc        | 0.8212|±  |0.0137|
|  - nutrition                          |      0|none            |     0|acc        | 0.7451|±  |0.0250|
|  - professional_accounting            |      0|none            |     0|acc        | 0.5000|±  |0.0298|
|  - professional_medicine              |      0|none            |     0|acc        | 0.7059|±  |0.0277|
|  - virology                           |      0|none            |     0|acc        | 0.5301|±  |0.0389|
| - social_sciences                     |N/A    |none            |     0|acc        | 0.7358|±  |0.0077|
|  - econometrics                       |      0|none            |     0|acc        | 0.4474|±  |0.0468|
|  - high_school_geography              |      0|none            |     0|acc        | 0.7525|±  |0.0307|
|  - high_school_government_and_politics|      0|none            |     0|acc        | 0.9016|±  |0.0215|
|  - high_school_macroeconomics         |      0|none            |     0|acc        | 0.6564|±  |0.0241|
|  - high_school_microeconomics         |      0|none            |     0|acc        | 0.6807|±  |0.0303|
|  - high_school_psychology             |      0|none            |     0|acc        | 0.8404|±  |0.0157|
|  - human_sexuality                    |      0|none            |     0|acc        | 0.7405|±  |0.0384|
|  - professional_psychology            |      0|none            |     0|acc        | 0.6552|±  |0.0192|
|  - public_relations                   |      0|none            |     0|acc        | 0.6727|±  |0.0449|
|  - security_studies                   |      0|none            |     0|acc        | 0.7673|±  |0.0270|
|  - sociology                          |      0|none            |     0|acc        | 0.8358|±  |0.0262|
|  - us_foreign_policy                  |      0|none            |     0|acc        | 0.8600|±  |0.0349|
| - stem                                |N/A    |none            |     0|acc        | 0.5265|±  |0.0085|
|  - abstract_algebra                   |      0|none            |     0|acc        | 0.3100|±  |0.0465|
|  - anatomy                            |      0|none            |     0|acc        | 0.6000|±  |0.0423|
|  - astronomy                          |      0|none            |     0|acc        | 0.6842|±  |0.0378|
|  - college_biology                    |      0|none            |     0|acc        | 0.7292|±  |0.0372|
|  - college_chemistry                  |      0|none            |     0|acc        | 0.4700|±  |0.0502|
|  - college_chemistry                  |      0|none            |     0|acc        | 0.4700|±  |0.0502|                                                                      [5/1966]
|  - college_computer_science           |      0|none            |     0|acc        | 0.5600|±  |0.0499|
|  - college_mathematics                |      0|none            |     0|acc        | 0.3500|±  |0.0479|
|  - college_physics                    |      0|none            |     0|acc        | 0.3529|±  |0.0476|
|  - computer_security                  |      0|none            |     0|acc        | 0.7100|±  |0.0456|
|  - conceptual_physics                 |      0|none            |     0|acc        | 0.5574|±  |0.0325|
|  - electrical_engineering             |      0|none            |     0|acc        | 0.5793|±  |0.0411|
|  - elementary_mathematics             |      0|none            |     0|acc        | 0.4101|±  |0.0253|
|  - high_school_biology                |      0|none            |     0|acc        | 0.7903|±  |0.0232|
|  - high_school_chemistry              |      0|none            |     0|acc        | 0.4828|±  |0.0352|
|  - high_school_computer_science       |      0|none            |     0|acc        | 0.6600|±  |0.0476|
|  - high_school_mathematics            |      0|none            |     0|acc        | 0.3444|±  |0.0290|
|  - high_school_physics                |      0|none            |     0|acc        | 0.3642|±  |0.0393|
|  - high_school_statistics             |      0|none            |     0|acc        | 0.5000|±  |0.0341|
|  - machine_learning                   |      0|none            |     0|acc        | 0.5268|±  |0.0474|
|hellaswag                              |      1|none            |     0|acc        | 0.6537|±  |0.0047|
|                                       |       |none            |     0|acc_norm   | 0.8429|±  |0.0036|
|gsm8k                                  |      3|strict-match    |     5|exact_match| 0.7218|±  |0.0123|
|                                       |       |flexible-extract|     5|exact_match| 0.7263|±  |0.0123|
|boolq                                  |      2|none            |     0|acc        | 0.8743|±  |0.0058|
|arc_easy                               |      1|none            |     0|acc        | 0.8502|±  |0.0073|
|                                       |       |none            |     0|acc_norm   | 0.8295|±  |0.0077|
|arc_challenge                          |      1|none            |     0|acc        | 0.5964|±  |0.0143|
|                                       |       |none            |     0|acc_norm   | 0.6169|±  |0.0142|

|      Groups      |Version|Filter|n-shot|  Metric   | Value |   |Stderr|
|------------------|-------|------|-----:|-----------|------:|---|-----:|
|truthfulqa        |N/A    |none  |     0|rouge1_max |47.8575|±  |0.8139|
|                  |       |none  |     0|bleu_max   |21.9412|±  |0.7280|
|                  |       |none  |     0|rouge2_max |32.7726|±  |0.9228|
|                  |       |none  |     0|rougeL_diff|-1.4310|±  |0.7806|
|                  |       |none  |     0|acc        | 0.4955|±  |0.0115|
|                  |       |none  |     0|bleu_diff  |-0.2883|±  |0.6228|
|                  |       |none  |     0|rouge2_acc | 0.3807|±  |0.0170|
|                  |       |none  |     0|rougeL_max |44.1785|±  |0.8274|
|                  |       |none  |     0|rougeL_acc | 0.4443|±  |0.0174|
|                  |       |none  |     0|rouge2_diff|-1.5603|±  |0.8950|
|                  |       |none  |     0|bleu_acc   | 0.4321|±  |0.0173|
|                  |       |none  |     0|rouge1_diff|-0.7276|±  |0.7721|
|                  |       |none  |     0|rouge1_acc | 0.4774|±  |0.0175|
|mmlu              |N/A    |none  |     0|acc        | 0.6300|±  |0.0038|
| - humanities     |N/A    |none  |     0|acc        | 0.5779|±  |0.0066|
| - other          |N/A    |none  |     0|acc        | 0.7090|±  |0.0078|
| - social_sciences|N/A    |none  |     0|acc        | 0.7358|±  |0.0077|
| - stem           |N/A    |none  |     0|acc        | 0.5265|±  |0.0085|