|
--- |
|
language: |
|
- en |
|
license: apache-2.0 |
|
library_name: transformers |
|
tags: |
|
- smollm2 |
|
- smollm2-360m |
|
- distillation |
|
model-index: |
|
- name: d-SmolLM2-360M |
|
results: |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: IFEval (0-Shot) |
|
type: HuggingFaceH4/ifeval |
|
args: |
|
num_few_shot: 0 |
|
metrics: |
|
- type: inst_level_strict_acc and prompt_level_strict_acc |
|
value: 20.97 |
|
name: strict accuracy |
|
source: |
|
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=aloobun/d-SmolLM2-360M |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: BBH (3-Shot) |
|
type: BBH |
|
args: |
|
num_few_shot: 3 |
|
metrics: |
|
- type: acc_norm |
|
value: 4.76 |
|
name: normalized accuracy |
|
source: |
|
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=aloobun/d-SmolLM2-360M |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: MATH Lvl 5 (4-Shot) |
|
type: hendrycks/competition_math |
|
args: |
|
num_few_shot: 4 |
|
metrics: |
|
- type: exact_match |
|
value: 0.23 |
|
name: exact match |
|
source: |
|
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=aloobun/d-SmolLM2-360M |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: GPQA (0-shot) |
|
type: Idavidrein/gpqa |
|
args: |
|
num_few_shot: 0 |
|
metrics: |
|
- type: acc_norm |
|
value: 0.45 |
|
name: acc_norm |
|
source: |
|
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=aloobun/d-SmolLM2-360M |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: MuSR (0-shot) |
|
type: TAUR-Lab/MuSR |
|
args: |
|
num_few_shot: 0 |
|
metrics: |
|
- type: acc_norm |
|
value: 7.76 |
|
name: acc_norm |
|
source: |
|
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=aloobun/d-SmolLM2-360M |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: MMLU-PRO (5-shot) |
|
type: TIGER-Lab/MMLU-Pro |
|
config: main |
|
split: test |
|
args: |
|
num_few_shot: 5 |
|
metrics: |
|
- type: acc |
|
value: 1.88 |
|
name: accuracy |
|
source: |
|
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=aloobun/d-SmolLM2-360M |
|
name: Open LLM Leaderboard |
|
--- |
|
|
|
This is a distillation experiment with SmolLM2-1.7B as teacher and SmolLM2-360M as student model. |
|
|
|
**Eval** results using SmolLM evaluation scripts (LightEval): |
|
|
|
Eval results using SmolLM evaluation scripts show distilled model slightly gained over base, in a few tasks. Small margins. |
|
|
|
| Task | Version | Metric | **aloobun/d-SmolLM2-360M** Value | **HuggingFaceTB/SmolLM2-360M** Value | |
|
|-----------------------|---------|----------|------------|----------| |
|
| all | | acc_norm | **0.4653** | **0.4642** | |
|
| | | qem | 0.0961 | 0.1004 | |
|
| custom:arc:_average:0 | | acc_norm | 0.5303 | 0.5305 | |
|
| custom:arc:challenge:0| 0 | acc_norm | 0.3771 | 0.3797 | |
|
| custom:arc:easy:0 | 0 | acc_norm | 0.6835 | 0.6814 | |
|
| custom:commonsense_qa:0| 0 | acc_norm | 0.3784 | 0.3759 | |
|
| custom:gsm8k:5 | 0 | qem | 0.0326 | 0.0334 | |
|
| custom:hellaswag:0 | 0 | acc_norm | 0.5418 | 0.5456 | |
|
| custom:mmlu_pro:0 | 0 | acc_norm | 0.1127 | 0.1130 | |
|
| custom:openbook_qa:0 | 0 | acc_norm | 0.3760 | 0.3720 | |
|
| custom:piqa:0 | 0 | acc_norm | 0.7214 | 0.7220 | |
|
| custom:trivia_qa:0 | 0 | qem | 0.1596 | 0.1675 | |
|
| custom:winogrande:0 | 0 | acc_norm | 0.5312 | 0.5241 | |
|
|
|
|
|
|
|
**Eval** results using lm-eval evaluation scripts: |
|
|
|
It slightly improves upon the performance of the basemodel on the following tasks: |
|
|
|
| Tasks |**HuggingFaceTB/SmolLM2-360M** Value|**aloobun/d-SmolLM2-360M** Value| |
|
|----------------------------------------------------------|-------------:|-------------:| |
|
| - leaderboard_bbh_causal_judgement | 0.4545 | 0.4652 | |
|
| - leaderboard_bbh_geometric_shapes | 0.1680 | 0.2040 | |
|
| - leaderboard_bbh_movie_recommendation | 0.2120 | 0.2440 | |
|
| - leaderboard_bbh_penguins_in_a_table | 0.2055 | 0.2123 | |
|
| - leaderboard_bbh_reasoning_about_colored_objects | 0.1160 | 0.1320 | |
|
| - leaderboard_bbh_ruin_names | 0.2360 | 0.2480 | |
|
| - leaderboard_bbh_salient_translation_error_detection | 0.1480 | 0.2120 | |
|
| - leaderboard_bbh_snarks | 0.5169 | 0.5281 | |
|
| - leaderboard_bbh_temporal_sequences | 0.2720 | 0.2800 | |
|
| - leaderboard_musr_murder_mysteries | 0.5040 | 0.5160 | |
|
|
|
|
|
Well, it didn’t work as well as I hoped, will try again. |
|
|
|
|
|
# Eval Results aloobun/d-SmolLM2-360M (WIP) |
|
|
|
|
|
## GPQA |
|
|
|
|
|
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |
|
|----------------------------|-------|------|-----:|--------|---|-----:|---|-----:| |
|
|leaderboard_gpqa | N/A| | | | | | | | |
|
| - leaderboard_gpqa_diamond | 1|none | 0|acc_norm|↑ |0.2071|± |0.0289| |
|
| - leaderboard_gpqa_extended| 1|none | 0|acc_norm|↑ |0.2308|± |0.0180| |
|
| - leaderboard_gpqa_main | 1|none | 0|acc_norm|↑ |0.2679|± |0.0209| |
|
|
|
## MUSR |
|
|
|
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |
|
|-------------------------------------|-------|------|-----:|--------|---|-----:|---|-----:| |
|
|leaderboard_musr | N/A| | | | | | | | |
|
| - leaderboard_musr_murder_mysteries | 1|none | 0|acc_norm|↑ |0.5160|± |0.0317| |
|
| - leaderboard_musr_object_placements| 1|none | 0|acc_norm|↑ |0.2383|± |0.0267| |
|
| - leaderboard_musr_team_allocation | 1|none | 0|acc_norm|↑ |0.4400|± |0.0315| |
|
|
|
|
|
## BBH |
|
|
|
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |
|
|----------------------------------------------------------|-------|------|-----:|--------|---|-----:|---|-----:| |
|
|leaderboard_bbh | N/A| | | | | | | | |
|
| - leaderboard_bbh_boolean_expressions | 1|none | 3|acc_norm|↑ |0.5480|± |0.0315| |
|
| - leaderboard_bbh_causal_judgement | 1|none | 3|acc_norm|↑ |0.4652|± |0.0366| |
|
| - leaderboard_bbh_date_understanding | 1|none | 3|acc_norm|↑ |0.1560|± |0.0230| |
|
| - leaderboard_bbh_disambiguation_qa | 1|none | 3|acc_norm|↑ |0.3120|± |0.0294| |
|
| - leaderboard_bbh_formal_fallacies | 1|none | 3|acc_norm|↑ |0.5240|± |0.0316| |
|
| - leaderboard_bbh_geometric_shapes | 1|none | 3|acc_norm|↑ |0.2040|± |0.0255| |
|
| - leaderboard_bbh_hyperbaton | 1|none | 3|acc_norm|↑ |0.5000|± |0.0317| |
|
| - leaderboard_bbh_logical_deduction_five_objects | 1|none | 3|acc_norm|↑ |0.2240|± |0.0264| |
|
| - leaderboard_bbh_logical_deduction_seven_objects | 1|none | 3|acc_norm|↑ |0.1440|± |0.0222| |
|
| - leaderboard_bbh_logical_deduction_three_objects | 1|none | 3|acc_norm|↑ |0.3320|± |0.0298| |
|
| - leaderboard_bbh_movie_recommendation | 1|none | 3|acc_norm|↑ |0.2440|± |0.0272| |
|
| - leaderboard_bbh_navigate | 1|none | 3|acc_norm|↑ |0.5800|± |0.0313| |
|
| - leaderboard_bbh_object_counting | 1|none | 3|acc_norm|↑ |0.2080|± |0.0257| |
|
| - leaderboard_bbh_penguins_in_a_table | 1|none | 3|acc_norm|↑ |0.2123|± |0.0340| |
|
| - leaderboard_bbh_reasoning_about_colored_objects | 1|none | 3|acc_norm|↑ |0.1320|± |0.0215| |
|
| - leaderboard_bbh_ruin_names | 1|none | 3|acc_norm|↑ |0.2480|± |0.0274| |
|
| - leaderboard_bbh_salient_translation_error_detection | 1|none | 3|acc_norm|↑ |0.2120|± |0.0259| |
|
| - leaderboard_bbh_snarks | 1|none | 3|acc_norm|↑ |0.5281|± |0.0375| |
|
| - leaderboard_bbh_sports_understanding | 1|none | 3|acc_norm|↑ |0.4600|± |0.0316| |
|
| - leaderboard_bbh_temporal_sequences | 1|none | 3|acc_norm|↑ |0.2800|± |0.0285| |
|
| - leaderboard_bbh_tracking_shuffled_objects_five_objects | 1|none | 3|acc_norm|↑ |0.1720|± |0.0239| |
|
| - leaderboard_bbh_tracking_shuffled_objects_seven_objects| 1|none | 3|acc_norm|↑ |0.1440|± |0.0222| |
|
| - leaderboard_bbh_tracking_shuffled_objects_three_objects| 1|none | 3|acc_norm|↑ |0.3000|± |0.0290| |
|
| - leaderboard_bbh_web_of_lies | 1|none | 3|acc_norm|↑ |0.5480|± |0.0315| |
|
|
|
|
|
## MMLU_PRO |
|
|
|
| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr| |
|
|--------------------|------:|------|-----:|------|---|-----:|---|-----:| |
|
|leaderboard_mmlu_pro| 0.1|none | 5|acc |↑ |0.1173|± |0.0029| |
|
|
|
## IFEVAL |
|
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |
|
|------------------|------:|------|-----:|-----------------------|---|-----:|---|------| |
|
|leaderboard_ifeval| 3|none | 0|inst_level_loose_acc |↑ |0.2866|± | N/A| |
|
| | |none | 0|inst_level_strict_acc |↑ |0.2770|± | N/A| |
|
| | |none | 0|prompt_level_loose_acc |↑ |0.1497|± |0.0154| |
|
| | |none | 0|prompt_level_strict_acc|↑ |0.1423|± |0.0150| |
|
|
|
## MATH HARD |
|
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |
|
|---------------------------------------------|-------|------|-----:|-----------|---|-----:|---|-----:| |
|
|leaderboard_math_hard | N/A| | | | | | | | |
|
| - leaderboard_math_algebra_hard | 2|none | 4|exact_match|↑ |0.0033|± |0.0033| |
|
| - leaderboard_math_counting_and_prob_hard | 2|none | 4|exact_match|↑ |0.0081|± |0.0081| |
|
| - leaderboard_math_geometry_hard | 2|none | 4|exact_match|↑ |0.0000|± |0.0000| |
|
| - leaderboard_math_intermediate_algebra_hard| 2|none | 4|exact_match|↑ |0.0000|± |0.0000| |
|
| - leaderboard_math_num_theory_hard | 2|none | 4|exact_match|↑ |0.0065|± |0.0065| |
|
| - leaderboard_math_prealgebra_hard | 2|none | 4|exact_match|↑ |0.0104|± |0.0073| |
|
| - leaderboard_math_precalculus_hard | 2|none | 4|exact_match|↑ |0.0000|± |0.0000| |
|
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) |
|
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_aloobun__d-SmolLM2-360M) |
|
|
|
| Metric |Value| |
|
|-------------------|----:| |
|
|Avg. | 6.01| |
|
|IFEval (0-Shot) |20.97| |
|
|BBH (3-Shot) | 4.76| |
|
|MATH Lvl 5 (4-Shot)| 0.23| |
|
|GPQA (0-shot) | 0.45| |
|
|MuSR (0-shot) | 7.76| |
|
|MMLU-PRO (5-shot) | 1.88| |
|
|
|
|