CarrotAI's picture
Upload folder using huggingface_hub
1582e05 verified
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8029|± |0.0110|
| | |strict-match | 5|exact_match|↑ |0.7961|± |0.0111|
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|----------------|------:|------|-----:|--------|---|-----:|---|------|
|kobest_boolq | 1|none | 5|acc |↑ |0.9167|± |0.0074|
| | |none | 5|f1 |↑ |0.9167|± | N/A|
|kobest_copa | 1|none | 5|acc |↑ |0.7130|± |0.0143|
| | |none | 5|f1 |↑ |0.7125|± | N/A|
|kobest_hellaswag| 1|none | 5|acc |↑ |0.4540|± |0.0223|
| | |none | 5|acc_norm|↑ |0.5700|± |0.0222|
| | |none | 5|f1 |↑ |0.4505|± | N/A|
|kobest_sentineg | 1|none | 5|acc |↑ |0.9496|± |0.0110|
| | |none | 5|f1 |↑ |0.9496|± | N/A|
|kobest_wic | 1|none | 5|acc |↑ |0.7111|± |0.0128|
| | |none | 5|f1 |↑ |0.7025|± | N/A|
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|-------------------------------------------------------|------:|------|-----:|-----------|---|-----:|---|-----:|
|kmmlu_direct_accounting | 2|none | 5|exact_match|↑ |0.5500|± |0.0500|
|kmmlu_direct_agricultural_sciences | 2|none | 5|exact_match|↑ |0.3680|± |0.0153|
|kmmlu_direct_aviation_engineering_and_maintenance | 2|none | 5|exact_match|↑ |0.4670|± |0.0158|
|kmmlu_direct_biology | 2|none | 5|exact_match|↑ |0.3740|± |0.0153|
|kmmlu_direct_chemical_engineering | 2|none | 5|exact_match|↑ |0.4650|± |0.0158|
|kmmlu_direct_chemistry | 2|none | 5|exact_match|↑ |0.4900|± |0.0204|
|kmmlu_direct_civil_engineering | 2|none | 5|exact_match|↑ |0.3540|± |0.0151|
|kmmlu_direct_computer_science | 2|none | 5|exact_match|↑ |0.7320|± |0.0140|
|kmmlu_direct_construction | 2|none | 5|exact_match|↑ |0.3590|± |0.0152|
|kmmlu_direct_criminal_law | 2|none | 5|exact_match|↑ |0.4250|± |0.0350|
|kmmlu_direct_ecology | 2|none | 5|exact_match|↑ |0.4900|± |0.0158|
|kmmlu_direct_economics | 2|none | 5|exact_match|↑ |0.6154|± |0.0428|
|kmmlu_direct_education | 2|none | 5|exact_match|↑ |0.6900|± |0.0465|
|kmmlu_direct_electrical_engineering | 2|none | 5|exact_match|↑ |0.3170|± |0.0147|
|kmmlu_direct_electronics_engineering | 2|none | 5|exact_match|↑ |0.5440|± |0.0158|
|kmmlu_direct_energy_management | 2|none | 5|exact_match|↑ |0.3960|± |0.0155|
|kmmlu_direct_environmental_science | 2|none | 5|exact_match|↑ |0.2950|± |0.0144|
|kmmlu_direct_fashion | 2|none | 5|exact_match|↑ |0.4660|± |0.0158|
|kmmlu_direct_food_processing | 2|none | 5|exact_match|↑ |0.4370|± |0.0157|
|kmmlu_direct_gas_technology_and_engineering | 2|none | 5|exact_match|↑ |0.3650|± |0.0152|
|kmmlu_direct_geomatics | 2|none | 5|exact_match|↑ |0.3770|± |0.0153|
|kmmlu_direct_health | 2|none | 5|exact_match|↑ |0.6200|± |0.0488|
|kmmlu_direct_industrial_engineer | 2|none | 5|exact_match|↑ |0.4730|± |0.0158|
|kmmlu_direct_information_technology | 2|none | 5|exact_match|↑ |0.7080|± |0.0144|
|kmmlu_direct_interior_architecture_and_design | 2|none | 5|exact_match|↑ |0.6080|± |0.0154|
|kmmlu_direct_korean_history | 2|none | 5|exact_match|↑ |0.3200|± |0.0469|
|kmmlu_direct_law | 2|none | 5|exact_match|↑ |0.4730|± |0.0158|
|kmmlu_direct_machine_design_and_manufacturing | 2|none | 5|exact_match|↑ |0.4750|± |0.0158|
|kmmlu_direct_management | 2|none | 5|exact_match|↑ |0.6160|± |0.0154|
|kmmlu_direct_maritime_engineering | 2|none | 5|exact_match|↑ |0.4817|± |0.0204|
|kmmlu_direct_marketing | 2|none | 5|exact_match|↑ |0.8010|± |0.0126|
|kmmlu_direct_materials_engineering | 2|none | 5|exact_match|↑ |0.4970|± |0.0158|
|kmmlu_direct_math | 2|none | 5|exact_match|↑ |0.3500|± |0.0276|
|kmmlu_direct_mechanical_engineering | 2|none | 5|exact_match|↑ |0.4040|± |0.0155|
|kmmlu_direct_nondestructive_testing | 2|none | 5|exact_match|↑ |0.4580|± |0.0158|
|kmmlu_direct_patent | 2|none | 5|exact_match|↑ |0.4100|± |0.0494|
|kmmlu_direct_political_science_and_sociology | 2|none | 5|exact_match|↑ |0.5500|± |0.0288|
|kmmlu_direct_psychology | 2|none | 5|exact_match|↑ |0.4700|± |0.0158|
|kmmlu_direct_public_safety | 2|none | 5|exact_match|↑ |0.3680|± |0.0153|
|kmmlu_direct_railway_and_automotive_engineering | 2|none | 5|exact_match|↑ |0.3550|± |0.0151|
|kmmlu_direct_real_estate | 2|none | 5|exact_match|↑ |0.4650|± |0.0354|
|kmmlu_direct_refrigerating_machinery | 2|none | 5|exact_match|↑ |0.3730|± |0.0153|
|kmmlu_direct_social_welfare | 2|none | 5|exact_match|↑ |0.6140|± |0.0154|
|kmmlu_direct_taxation | 2|none | 5|exact_match|↑ |0.4050|± |0.0348|
|kmmlu_direct_telecommunications_and_wireless_technology| 2|none | 5|exact_match|↑ |0.6080|± |0.0154|
| Groups |Version|Filter|n-shot|Metric| |Value | |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu | 2|none | |acc |↑ |0.6755|± |0.0038|
| - humanities | 2|none | |acc |↑ |0.6140|± |0.0067|
| - other | 2|none | |acc |↑ |0.7271|± |0.0077|
| - social sciences| 2|none | |acc |↑ |0.7793|± |0.0073|
| - stem | 2|none | |acc |↑ |0.6153|± |0.0084|