|
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |
|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |
|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8029|± |0.0110| |
|
| | |strict-match | 5|exact_match|↑ |0.7961|± |0.0111| |
|
|
|
|
|
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |
|
|----------------|------:|------|-----:|--------|---|-----:|---|------| |
|
|kobest_boolq | 1|none | 5|acc |↑ |0.9167|± |0.0074| |
|
| | |none | 5|f1 |↑ |0.9167|± | N/A| |
|
|kobest_copa | 1|none | 5|acc |↑ |0.7130|± |0.0143| |
|
| | |none | 5|f1 |↑ |0.7125|± | N/A| |
|
|kobest_hellaswag| 1|none | 5|acc |↑ |0.4540|± |0.0223| |
|
| | |none | 5|acc_norm|↑ |0.5700|± |0.0222| |
|
| | |none | 5|f1 |↑ |0.4505|± | N/A| |
|
|kobest_sentineg | 1|none | 5|acc |↑ |0.9496|± |0.0110| |
|
| | |none | 5|f1 |↑ |0.9496|± | N/A| |
|
|kobest_wic | 1|none | 5|acc |↑ |0.7111|± |0.0128| |
|
| | |none | 5|f1 |↑ |0.7025|± | N/A| |
|
|
|
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |
|
|-------------------------------------------------------|------:|------|-----:|-----------|---|-----:|---|-----:| |
|
|kmmlu_direct_accounting | 2|none | 5|exact_match|↑ |0.5500|± |0.0500| |
|
|kmmlu_direct_agricultural_sciences | 2|none | 5|exact_match|↑ |0.3680|± |0.0153| |
|
|kmmlu_direct_aviation_engineering_and_maintenance | 2|none | 5|exact_match|↑ |0.4670|± |0.0158| |
|
|kmmlu_direct_biology | 2|none | 5|exact_match|↑ |0.3740|± |0.0153| |
|
|kmmlu_direct_chemical_engineering | 2|none | 5|exact_match|↑ |0.4650|± |0.0158| |
|
|kmmlu_direct_chemistry | 2|none | 5|exact_match|↑ |0.4900|± |0.0204| |
|
|kmmlu_direct_civil_engineering | 2|none | 5|exact_match|↑ |0.3540|± |0.0151| |
|
|kmmlu_direct_computer_science | 2|none | 5|exact_match|↑ |0.7320|± |0.0140| |
|
|kmmlu_direct_construction | 2|none | 5|exact_match|↑ |0.3590|± |0.0152| |
|
|kmmlu_direct_criminal_law | 2|none | 5|exact_match|↑ |0.4250|± |0.0350| |
|
|kmmlu_direct_ecology | 2|none | 5|exact_match|↑ |0.4900|± |0.0158| |
|
|kmmlu_direct_economics | 2|none | 5|exact_match|↑ |0.6154|± |0.0428| |
|
|kmmlu_direct_education | 2|none | 5|exact_match|↑ |0.6900|± |0.0465| |
|
|kmmlu_direct_electrical_engineering | 2|none | 5|exact_match|↑ |0.3170|± |0.0147| |
|
|kmmlu_direct_electronics_engineering | 2|none | 5|exact_match|↑ |0.5440|± |0.0158| |
|
|kmmlu_direct_energy_management | 2|none | 5|exact_match|↑ |0.3960|± |0.0155| |
|
|kmmlu_direct_environmental_science | 2|none | 5|exact_match|↑ |0.2950|± |0.0144| |
|
|kmmlu_direct_fashion | 2|none | 5|exact_match|↑ |0.4660|± |0.0158| |
|
|kmmlu_direct_food_processing | 2|none | 5|exact_match|↑ |0.4370|± |0.0157| |
|
|kmmlu_direct_gas_technology_and_engineering | 2|none | 5|exact_match|↑ |0.3650|± |0.0152| |
|
|kmmlu_direct_geomatics | 2|none | 5|exact_match|↑ |0.3770|± |0.0153| |
|
|kmmlu_direct_health | 2|none | 5|exact_match|↑ |0.6200|± |0.0488| |
|
|kmmlu_direct_industrial_engineer | 2|none | 5|exact_match|↑ |0.4730|± |0.0158| |
|
|kmmlu_direct_information_technology | 2|none | 5|exact_match|↑ |0.7080|± |0.0144| |
|
|kmmlu_direct_interior_architecture_and_design | 2|none | 5|exact_match|↑ |0.6080|± |0.0154| |
|
|kmmlu_direct_korean_history | 2|none | 5|exact_match|↑ |0.3200|± |0.0469| |
|
|kmmlu_direct_law | 2|none | 5|exact_match|↑ |0.4730|± |0.0158| |
|
|kmmlu_direct_machine_design_and_manufacturing | 2|none | 5|exact_match|↑ |0.4750|± |0.0158| |
|
|kmmlu_direct_management | 2|none | 5|exact_match|↑ |0.6160|± |0.0154| |
|
|kmmlu_direct_maritime_engineering | 2|none | 5|exact_match|↑ |0.4817|± |0.0204| |
|
|kmmlu_direct_marketing | 2|none | 5|exact_match|↑ |0.8010|± |0.0126| |
|
|kmmlu_direct_materials_engineering | 2|none | 5|exact_match|↑ |0.4970|± |0.0158| |
|
|kmmlu_direct_math | 2|none | 5|exact_match|↑ |0.3500|± |0.0276| |
|
|kmmlu_direct_mechanical_engineering | 2|none | 5|exact_match|↑ |0.4040|± |0.0155| |
|
|kmmlu_direct_nondestructive_testing | 2|none | 5|exact_match|↑ |0.4580|± |0.0158| |
|
|kmmlu_direct_patent | 2|none | 5|exact_match|↑ |0.4100|± |0.0494| |
|
|kmmlu_direct_political_science_and_sociology | 2|none | 5|exact_match|↑ |0.5500|± |0.0288| |
|
|kmmlu_direct_psychology | 2|none | 5|exact_match|↑ |0.4700|± |0.0158| |
|
|kmmlu_direct_public_safety | 2|none | 5|exact_match|↑ |0.3680|± |0.0153| |
|
|kmmlu_direct_railway_and_automotive_engineering | 2|none | 5|exact_match|↑ |0.3550|± |0.0151| |
|
|kmmlu_direct_real_estate | 2|none | 5|exact_match|↑ |0.4650|± |0.0354| |
|
|kmmlu_direct_refrigerating_machinery | 2|none | 5|exact_match|↑ |0.3730|± |0.0153| |
|
|kmmlu_direct_social_welfare | 2|none | 5|exact_match|↑ |0.6140|± |0.0154| |
|
|kmmlu_direct_taxation | 2|none | 5|exact_match|↑ |0.4050|± |0.0348| |
|
|kmmlu_direct_telecommunications_and_wireless_technology| 2|none | 5|exact_match|↑ |0.6080|± |0.0154| |
|
|
|
| Groups |Version|Filter|n-shot|Metric| |Value | |Stderr| |
|
|------------------|------:|------|------|------|---|-----:|---|-----:| |
|
|mmlu | 2|none | |acc |↑ |0.6755|± |0.0038| |
|
| - humanities | 2|none | |acc |↑ |0.6140|± |0.0067| |
|
| - other | 2|none | |acc |↑ |0.7271|± |0.0077| |
|
| - social sciences| 2|none | |acc |↑ |0.7793|± |0.0073| |
|
| - stem | 2|none | |acc |↑ |0.6153|± |0.0084| |