d-SmolLM2-360M / README.md
aloobun's picture
Adding Evaluation Results (#1)
4b51c7d verified
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- smollm2
- smollm2-360m
- distillation
model-index:
- name: d-SmolLM2-360M
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: IFEval (0-Shot)
type: HuggingFaceH4/ifeval
args:
num_few_shot: 0
metrics:
- type: inst_level_strict_acc and prompt_level_strict_acc
value: 20.97
name: strict accuracy
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=aloobun/d-SmolLM2-360M
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: BBH (3-Shot)
type: BBH
args:
num_few_shot: 3
metrics:
- type: acc_norm
value: 4.76
name: normalized accuracy
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=aloobun/d-SmolLM2-360M
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MATH Lvl 5 (4-Shot)
type: hendrycks/competition_math
args:
num_few_shot: 4
metrics:
- type: exact_match
value: 0.23
name: exact match
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=aloobun/d-SmolLM2-360M
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GPQA (0-shot)
type: Idavidrein/gpqa
args:
num_few_shot: 0
metrics:
- type: acc_norm
value: 0.45
name: acc_norm
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=aloobun/d-SmolLM2-360M
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MuSR (0-shot)
type: TAUR-Lab/MuSR
args:
num_few_shot: 0
metrics:
- type: acc_norm
value: 7.76
name: acc_norm
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=aloobun/d-SmolLM2-360M
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU-PRO (5-shot)
type: TIGER-Lab/MMLU-Pro
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 1.88
name: accuracy
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=aloobun/d-SmolLM2-360M
name: Open LLM Leaderboard
---
This is a distillation experiment with SmolLM2-1.7B as teacher and SmolLM2-360M as student model.
**Eval** results using SmolLM evaluation scripts (LightEval):
Eval results using SmolLM evaluation scripts show distilled model slightly gained over base, in a few tasks. Small margins.
| Task | Version | Metric | **aloobun/d-SmolLM2-360M** Value | **HuggingFaceTB/SmolLM2-360M** Value |
|-----------------------|---------|----------|------------|----------|
| all | | acc_norm | **0.4653** | **0.4642** |
| | | qem | 0.0961 | 0.1004 |
| custom:arc:_average:0 | | acc_norm | 0.5303 | 0.5305 |
| custom:arc:challenge:0| 0 | acc_norm | 0.3771 | 0.3797 |
| custom:arc:easy:0 | 0 | acc_norm | 0.6835 | 0.6814 |
| custom:commonsense_qa:0| 0 | acc_norm | 0.3784 | 0.3759 |
| custom:gsm8k:5 | 0 | qem | 0.0326 | 0.0334 |
| custom:hellaswag:0 | 0 | acc_norm | 0.5418 | 0.5456 |
| custom:mmlu_pro:0 | 0 | acc_norm | 0.1127 | 0.1130 |
| custom:openbook_qa:0 | 0 | acc_norm | 0.3760 | 0.3720 |
| custom:piqa:0 | 0 | acc_norm | 0.7214 | 0.7220 |
| custom:trivia_qa:0 | 0 | qem | 0.1596 | 0.1675 |
| custom:winogrande:0 | 0 | acc_norm | 0.5312 | 0.5241 |
**Eval** results using lm-eval evaluation scripts:
It slightly improves upon the performance of the basemodel on the following tasks:
| Tasks |**HuggingFaceTB/SmolLM2-360M** Value|**aloobun/d-SmolLM2-360M** Value|
|----------------------------------------------------------|-------------:|-------------:|
| - leaderboard_bbh_causal_judgement | 0.4545 | 0.4652 |
| - leaderboard_bbh_geometric_shapes | 0.1680 | 0.2040 |
| - leaderboard_bbh_movie_recommendation | 0.2120 | 0.2440 |
| - leaderboard_bbh_penguins_in_a_table | 0.2055 | 0.2123 |
| - leaderboard_bbh_reasoning_about_colored_objects | 0.1160 | 0.1320 |
| - leaderboard_bbh_ruin_names | 0.2360 | 0.2480 |
| - leaderboard_bbh_salient_translation_error_detection | 0.1480 | 0.2120 |
| - leaderboard_bbh_snarks | 0.5169 | 0.5281 |
| - leaderboard_bbh_temporal_sequences | 0.2720 | 0.2800 |
| - leaderboard_musr_murder_mysteries | 0.5040 | 0.5160 |
Well, it didn’t work as well as I hoped, will try again.
# Eval Results aloobun/d-SmolLM2-360M (WIP)
## GPQA
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|----------------------------|-------|------|-----:|--------|---|-----:|---|-----:|
|leaderboard_gpqa | N/A| | | | | | | |
| - leaderboard_gpqa_diamond | 1|none | 0|acc_norm|↑ |0.2071|± |0.0289|
| - leaderboard_gpqa_extended| 1|none | 0|acc_norm|↑ |0.2308|± |0.0180|
| - leaderboard_gpqa_main | 1|none | 0|acc_norm|↑ |0.2679|± |0.0209|
## MUSR
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|-------------------------------------|-------|------|-----:|--------|---|-----:|---|-----:|
|leaderboard_musr | N/A| | | | | | | |
| - leaderboard_musr_murder_mysteries | 1|none | 0|acc_norm|↑ |0.5160|± |0.0317|
| - leaderboard_musr_object_placements| 1|none | 0|acc_norm|↑ |0.2383|± |0.0267|
| - leaderboard_musr_team_allocation | 1|none | 0|acc_norm|↑ |0.4400|± |0.0315|
## BBH
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|----------------------------------------------------------|-------|------|-----:|--------|---|-----:|---|-----:|
|leaderboard_bbh | N/A| | | | | | | |
| - leaderboard_bbh_boolean_expressions | 1|none | 3|acc_norm|↑ |0.5480|± |0.0315|
| - leaderboard_bbh_causal_judgement | 1|none | 3|acc_norm|↑ |0.4652|± |0.0366|
| - leaderboard_bbh_date_understanding | 1|none | 3|acc_norm|↑ |0.1560|± |0.0230|
| - leaderboard_bbh_disambiguation_qa | 1|none | 3|acc_norm|↑ |0.3120|± |0.0294|
| - leaderboard_bbh_formal_fallacies | 1|none | 3|acc_norm|↑ |0.5240|± |0.0316|
| - leaderboard_bbh_geometric_shapes | 1|none | 3|acc_norm|↑ |0.2040|± |0.0255|
| - leaderboard_bbh_hyperbaton | 1|none | 3|acc_norm|↑ |0.5000|± |0.0317|
| - leaderboard_bbh_logical_deduction_five_objects | 1|none | 3|acc_norm|↑ |0.2240|± |0.0264|
| - leaderboard_bbh_logical_deduction_seven_objects | 1|none | 3|acc_norm|↑ |0.1440|± |0.0222|
| - leaderboard_bbh_logical_deduction_three_objects | 1|none | 3|acc_norm|↑ |0.3320|± |0.0298|
| - leaderboard_bbh_movie_recommendation | 1|none | 3|acc_norm|↑ |0.2440|± |0.0272|
| - leaderboard_bbh_navigate | 1|none | 3|acc_norm|↑ |0.5800|± |0.0313|
| - leaderboard_bbh_object_counting | 1|none | 3|acc_norm|↑ |0.2080|± |0.0257|
| - leaderboard_bbh_penguins_in_a_table | 1|none | 3|acc_norm|↑ |0.2123|± |0.0340|
| - leaderboard_bbh_reasoning_about_colored_objects | 1|none | 3|acc_norm|↑ |0.1320|± |0.0215|
| - leaderboard_bbh_ruin_names | 1|none | 3|acc_norm|↑ |0.2480|± |0.0274|
| - leaderboard_bbh_salient_translation_error_detection | 1|none | 3|acc_norm|↑ |0.2120|± |0.0259|
| - leaderboard_bbh_snarks | 1|none | 3|acc_norm|↑ |0.5281|± |0.0375|
| - leaderboard_bbh_sports_understanding | 1|none | 3|acc_norm|↑ |0.4600|± |0.0316|
| - leaderboard_bbh_temporal_sequences | 1|none | 3|acc_norm|↑ |0.2800|± |0.0285|
| - leaderboard_bbh_tracking_shuffled_objects_five_objects | 1|none | 3|acc_norm|↑ |0.1720|± |0.0239|
| - leaderboard_bbh_tracking_shuffled_objects_seven_objects| 1|none | 3|acc_norm|↑ |0.1440|± |0.0222|
| - leaderboard_bbh_tracking_shuffled_objects_three_objects| 1|none | 3|acc_norm|↑ |0.3000|± |0.0290|
| - leaderboard_bbh_web_of_lies | 1|none | 3|acc_norm|↑ |0.5480|± |0.0315|
## MMLU_PRO
| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr|
|--------------------|------:|------|-----:|------|---|-----:|---|-----:|
|leaderboard_mmlu_pro| 0.1|none | 5|acc |↑ |0.1173|± |0.0029|
## IFEVAL
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|------------------|------:|------|-----:|-----------------------|---|-----:|---|------|
|leaderboard_ifeval| 3|none | 0|inst_level_loose_acc |↑ |0.2866|± | N/A|
| | |none | 0|inst_level_strict_acc |↑ |0.2770|± | N/A|
| | |none | 0|prompt_level_loose_acc |↑ |0.1497|± |0.0154|
| | |none | 0|prompt_level_strict_acc|↑ |0.1423|± |0.0150|
## MATH HARD
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|---------------------------------------------|-------|------|-----:|-----------|---|-----:|---|-----:|
|leaderboard_math_hard | N/A| | | | | | | |
| - leaderboard_math_algebra_hard | 2|none | 4|exact_match|↑ |0.0033|± |0.0033|
| - leaderboard_math_counting_and_prob_hard | 2|none | 4|exact_match|↑ |0.0081|± |0.0081|
| - leaderboard_math_geometry_hard | 2|none | 4|exact_match|↑ |0.0000|± |0.0000|
| - leaderboard_math_intermediate_algebra_hard| 2|none | 4|exact_match|↑ |0.0000|± |0.0000|
| - leaderboard_math_num_theory_hard | 2|none | 4|exact_match|↑ |0.0065|± |0.0065|
| - leaderboard_math_prealgebra_hard | 2|none | 4|exact_match|↑ |0.0104|± |0.0073|
| - leaderboard_math_precalculus_hard | 2|none | 4|exact_match|↑ |0.0000|± |0.0000|
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_aloobun__d-SmolLM2-360M)
| Metric |Value|
|-------------------|----:|
|Avg. | 6.01|
|IFEval (0-Shot) |20.97|
|BBH (3-Shot) | 4.76|
|MATH Lvl 5 (4-Shot)| 0.23|
|GPQA (0-shot) | 0.45|
|MuSR (0-shot) | 7.76|
|MMLU-PRO (5-shot) | 1.88|