d-SmolLM2-360M / README.md

Adding Evaluation Results (#1)

4b51c7d verified 4 months ago

12.1 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- smollm2
	- smollm2-360m
	- distillation
	model-index:
	- name: d-SmolLM2-360M
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: IFEval (0-Shot)
	type: HuggingFaceH4/ifeval
	args:
	num_few_shot: 0
	metrics:
	- type: inst_level_strict_acc and prompt_level_strict_acc
	value: 20.97
	name: strict accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=aloobun/d-SmolLM2-360M
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: BBH (3-Shot)
	type: BBH
	args:
	num_few_shot: 3
	metrics:
	- type: acc_norm
	value: 4.76
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=aloobun/d-SmolLM2-360M
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MATH Lvl 5 (4-Shot)
	type: hendrycks/competition_math
	args:
	num_few_shot: 4
	metrics:
	- type: exact_match
	value: 0.23
	name: exact match
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=aloobun/d-SmolLM2-360M
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GPQA (0-shot)
	type: Idavidrein/gpqa
	args:
	num_few_shot: 0
	metrics:
	- type: acc_norm
	value: 0.45
	name: acc_norm
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=aloobun/d-SmolLM2-360M
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MuSR (0-shot)
	type: TAUR-Lab/MuSR
	args:
	num_few_shot: 0
	metrics:
	- type: acc_norm
	value: 7.76
	name: acc_norm
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=aloobun/d-SmolLM2-360M
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU-PRO (5-shot)
	type: TIGER-Lab/MMLU-Pro
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 1.88
	name: accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=aloobun/d-SmolLM2-360M
	name: Open LLM Leaderboard
	---

	This is a distillation experiment with SmolLM2-1.7B as teacher and SmolLM2-360M as student model.

	Eval results using SmolLM evaluation scripts (LightEval):

	Eval results using SmolLM evaluation scripts show distilled model slightly gained over base, in a few tasks. Small margins.

	\| Task \| Version \| Metric \| aloobun/d-SmolLM2-360M Value \| HuggingFaceTB/SmolLM2-360M Value \|
	\|-----------------------\|---------\|----------\|------------\|----------\|
	\| all \| \| acc_norm \| 0.4653 \| 0.4642 \|
	\| \| \| qem \| 0.0961 \| 0.1004 \|
	\| custom:arc:_average:0 \| \| acc_norm \| 0.5303 \| 0.5305 \|
	\| custom:arc:challenge:0\| 0 \| acc_norm \| 0.3771 \| 0.3797 \|
	\| custom:arc:easy:0 \| 0 \| acc_norm \| 0.6835 \| 0.6814 \|
	\| custom:commonsense_qa:0\| 0 \| acc_norm \| 0.3784 \| 0.3759 \|
	\| custom:gsm8k:5 \| 0 \| qem \| 0.0326 \| 0.0334 \|
	\| custom:hellaswag:0 \| 0 \| acc_norm \| 0.5418 \| 0.5456 \|
	\| custom:mmlu_pro:0 \| 0 \| acc_norm \| 0.1127 \| 0.1130 \|
	\| custom:openbook_qa:0 \| 0 \| acc_norm \| 0.3760 \| 0.3720 \|
	\| custom:piqa:0 \| 0 \| acc_norm \| 0.7214 \| 0.7220 \|
	\| custom:trivia_qa:0 \| 0 \| qem \| 0.1596 \| 0.1675 \|
	\| custom:winogrande:0 \| 0 \| acc_norm \| 0.5312 \| 0.5241 \|



	Eval results using lm-eval evaluation scripts:

	It slightly improves upon the performance of the basemodel on the following tasks:

	\| Tasks \|HuggingFaceTB/SmolLM2-360M Value\|aloobun/d-SmolLM2-360M Value\|
	\|----------------------------------------------------------\|-------------:\|-------------:\|
	\| - leaderboard_bbh_causal_judgement \| 0.4545 \| 0.4652 \|
	\| - leaderboard_bbh_geometric_shapes \| 0.1680 \| 0.2040 \|
	\| - leaderboard_bbh_movie_recommendation \| 0.2120 \| 0.2440 \|
	\| - leaderboard_bbh_penguins_in_a_table \| 0.2055 \| 0.2123 \|
	\| - leaderboard_bbh_reasoning_about_colored_objects \| 0.1160 \| 0.1320 \|
	\| - leaderboard_bbh_ruin_names \| 0.2360 \| 0.2480 \|
	\| - leaderboard_bbh_salient_translation_error_detection \| 0.1480 \| 0.2120 \|
	\| - leaderboard_bbh_snarks \| 0.5169 \| 0.5281 \|
	\| - leaderboard_bbh_temporal_sequences \| 0.2720 \| 0.2800 \|
	\| - leaderboard_musr_murder_mysteries \| 0.5040 \| 0.5160 \|


	Well, it didn’t work as well as I hoped, will try again.


	# Eval Results aloobun/d-SmolLM2-360M (WIP)


	## GPQA


	\| Tasks \|Version\|Filter\|n-shot\| Metric \| \|Value \| \|Stderr\|
	\|----------------------------\|-------\|------\|-----:\|--------\|---\|-----:\|---\|-----:\|
	\|leaderboard_gpqa \| N/A\| \| \| \| \| \| \| \|
	\| - leaderboard_gpqa_diamond \| 1\|none \| 0\|acc_norm\|↑ \|0.2071\|± \|0.0289\|
	\| - leaderboard_gpqa_extended\| 1\|none \| 0\|acc_norm\|↑ \|0.2308\|± \|0.0180\|
	\| - leaderboard_gpqa_main \| 1\|none \| 0\|acc_norm\|↑ \|0.2679\|± \|0.0209\|

	## MUSR

	\| Tasks \|Version\|Filter\|n-shot\| Metric \| \|Value \| \|Stderr\|
	\|-------------------------------------\|-------\|------\|-----:\|--------\|---\|-----:\|---\|-----:\|
	\|leaderboard_musr \| N/A\| \| \| \| \| \| \| \|
	\| - leaderboard_musr_murder_mysteries \| 1\|none \| 0\|acc_norm\|↑ \|0.5160\|± \|0.0317\|
	\| - leaderboard_musr_object_placements\| 1\|none \| 0\|acc_norm\|↑ \|0.2383\|± \|0.0267\|
	\| - leaderboard_musr_team_allocation \| 1\|none \| 0\|acc_norm\|↑ \|0.4400\|± \|0.0315\|


	## BBH

	\| Tasks \|Version\|Filter\|n-shot\| Metric \| \|Value \| \|Stderr\|
	\|----------------------------------------------------------\|-------\|------\|-----:\|--------\|---\|-----:\|---\|-----:\|
	\|leaderboard_bbh \| N/A\| \| \| \| \| \| \| \|
	\| - leaderboard_bbh_boolean_expressions \| 1\|none \| 3\|acc_norm\|↑ \|0.5480\|± \|0.0315\|
	\| - leaderboard_bbh_causal_judgement \| 1\|none \| 3\|acc_norm\|↑ \|0.4652\|± \|0.0366\|
	\| - leaderboard_bbh_date_understanding \| 1\|none \| 3\|acc_norm\|↑ \|0.1560\|± \|0.0230\|
	\| - leaderboard_bbh_disambiguation_qa \| 1\|none \| 3\|acc_norm\|↑ \|0.3120\|± \|0.0294\|
	\| - leaderboard_bbh_formal_fallacies \| 1\|none \| 3\|acc_norm\|↑ \|0.5240\|± \|0.0316\|
	\| - leaderboard_bbh_geometric_shapes \| 1\|none \| 3\|acc_norm\|↑ \|0.2040\|± \|0.0255\|
	\| - leaderboard_bbh_hyperbaton \| 1\|none \| 3\|acc_norm\|↑ \|0.5000\|± \|0.0317\|
	\| - leaderboard_bbh_logical_deduction_five_objects \| 1\|none \| 3\|acc_norm\|↑ \|0.2240\|± \|0.0264\|
	\| - leaderboard_bbh_logical_deduction_seven_objects \| 1\|none \| 3\|acc_norm\|↑ \|0.1440\|± \|0.0222\|
	\| - leaderboard_bbh_logical_deduction_three_objects \| 1\|none \| 3\|acc_norm\|↑ \|0.3320\|± \|0.0298\|
	\| - leaderboard_bbh_movie_recommendation \| 1\|none \| 3\|acc_norm\|↑ \|0.2440\|± \|0.0272\|
	\| - leaderboard_bbh_navigate \| 1\|none \| 3\|acc_norm\|↑ \|0.5800\|± \|0.0313\|
	\| - leaderboard_bbh_object_counting \| 1\|none \| 3\|acc_norm\|↑ \|0.2080\|± \|0.0257\|
	\| - leaderboard_bbh_penguins_in_a_table \| 1\|none \| 3\|acc_norm\|↑ \|0.2123\|± \|0.0340\|
	\| - leaderboard_bbh_reasoning_about_colored_objects \| 1\|none \| 3\|acc_norm\|↑ \|0.1320\|± \|0.0215\|
	\| - leaderboard_bbh_ruin_names \| 1\|none \| 3\|acc_norm\|↑ \|0.2480\|± \|0.0274\|
	\| - leaderboard_bbh_salient_translation_error_detection \| 1\|none \| 3\|acc_norm\|↑ \|0.2120\|± \|0.0259\|
	\| - leaderboard_bbh_snarks \| 1\|none \| 3\|acc_norm\|↑ \|0.5281\|± \|0.0375\|
	\| - leaderboard_bbh_sports_understanding \| 1\|none \| 3\|acc_norm\|↑ \|0.4600\|± \|0.0316\|
	\| - leaderboard_bbh_temporal_sequences \| 1\|none \| 3\|acc_norm\|↑ \|0.2800\|± \|0.0285\|
	\| - leaderboard_bbh_tracking_shuffled_objects_five_objects \| 1\|none \| 3\|acc_norm\|↑ \|0.1720\|± \|0.0239\|
	\| - leaderboard_bbh_tracking_shuffled_objects_seven_objects\| 1\|none \| 3\|acc_norm\|↑ \|0.1440\|± \|0.0222\|
	\| - leaderboard_bbh_tracking_shuffled_objects_three_objects\| 1\|none \| 3\|acc_norm\|↑ \|0.3000\|± \|0.0290\|
	\| - leaderboard_bbh_web_of_lies \| 1\|none \| 3\|acc_norm\|↑ \|0.5480\|± \|0.0315\|


	## MMLU_PRO

	\| Tasks \|Version\|Filter\|n-shot\|Metric\| \|Value \| \|Stderr\|
	\|--------------------\|------:\|------\|-----:\|------\|---\|-----:\|---\|-----:\|
	\|leaderboard_mmlu_pro\| 0.1\|none \| 5\|acc \|↑ \|0.1173\|± \|0.0029\|

	## IFEVAL
	\| Tasks \|Version\|Filter\|n-shot\| Metric \| \|Value \| \|Stderr\|
	\|------------------\|------:\|------\|-----:\|-----------------------\|---\|-----:\|---\|------\|
	\|leaderboard_ifeval\| 3\|none \| 0\|inst_level_loose_acc \|↑ \|0.2866\|± \| N/A\|
	\| \| \|none \| 0\|inst_level_strict_acc \|↑ \|0.2770\|± \| N/A\|
	\| \| \|none \| 0\|prompt_level_loose_acc \|↑ \|0.1497\|± \|0.0154\|
	\| \| \|none \| 0\|prompt_level_strict_acc\|↑ \|0.1423\|± \|0.0150\|

	## MATH HARD
	\| Tasks \|Version\|Filter\|n-shot\| Metric \| \|Value \| \|Stderr\|
	\|---------------------------------------------\|-------\|------\|-----:\|-----------\|---\|-----:\|---\|-----:\|
	\|leaderboard_math_hard \| N/A\| \| \| \| \| \| \| \|
	\| - leaderboard_math_algebra_hard \| 2\|none \| 4\|exact_match\|↑ \|0.0033\|± \|0.0033\|
	\| - leaderboard_math_counting_and_prob_hard \| 2\|none \| 4\|exact_match\|↑ \|0.0081\|± \|0.0081\|
	\| - leaderboard_math_geometry_hard \| 2\|none \| 4\|exact_match\|↑ \|0.0000\|± \|0.0000\|
	\| - leaderboard_math_intermediate_algebra_hard\| 2\|none \| 4\|exact_match\|↑ \|0.0000\|± \|0.0000\|
	\| - leaderboard_math_num_theory_hard \| 2\|none \| 4\|exact_match\|↑ \|0.0065\|± \|0.0065\|
	\| - leaderboard_math_prealgebra_hard \| 2\|none \| 4\|exact_match\|↑ \|0.0104\|± \|0.0073\|
	\| - leaderboard_math_precalculus_hard \| 2\|none \| 4\|exact_match\|↑ \|0.0000\|± \|0.0000\|
	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_aloobun__d-SmolLM2-360M)

	\| Metric \|Value\|
	\|-------------------\|----:\|
	\|Avg. \| 6.01\|
	\|IFEval (0-Shot) \|20.97\|
	\|BBH (3-Shot) \| 4.76\|
	\|MATH Lvl 5 (4-Shot)\| 0.23\|
	\|GPQA (0-shot) \| 0.45\|
	\|MuSR (0-shot) \| 7.76\|
	\|MMLU-PRO (5-shot) \| 1.88\|