distily_TinyStories-33M

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 5505.2720
eval_frwikippl: 21773.6699
eval_zhwikippl: 149216.0938
eval_loss: 1.1383
eval_runtime: 51.1413
eval_samples_per_second: 48.884
eval_steps_per_second: 6.12

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=5000.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=500.0, loss_fn=jsd, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 4e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.2949 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	zhwikippl
teacher eval		20633.1680	131577.2812					7615.4468
0	0	57409.7656	57878.0820	11.7972	40.6672	61.475	7.697	56928.0781
1000	0.0323	10372.9512	76930.4531	1.9053	41.7953	59.815	7.489	858113.625
2000	0.0646	8020.6040	46711.9688	1.6472	41.0642	60.88	7.622	367518.3125
3000	0.0970	8157.5376	45240.3945	1.5278	45.4508	55.005	6.887	515510.5625
4000	0.1293	7411.5596	36822.6484	1.4337	51.1158	48.909	6.123	421034.4688
5000	0.1616	6422.7583	28339.4023	1.3515	51.1748	48.852	6.116	267027.4375
6000	0.1939	6131.3276	24695.6113	1.2750	50.9734	49.045	6.14	194273.2656
7000	0.2263	5802.4341	23374.1562	1.2199	50.8571	49.157	6.155	168406.4688
8000	0.2586	5621.9170	21168.1855	1.1773	51.0097	49.01	6.136	164012.0469
9000	0.2909	5505.2720	21773.6699	1.1383	51.1413	48.884	6.12	149216.0938
10000	0.3232	5617.5493	21623.7461	1.1134	51.0853	48.938	6.127	148977.0625
11000	0.3555	5438.9810	21305.9277	1.0901	51.2289	48.801	6.11	148262.7188
12000	0.3879	5601.4360	22292.5059	1.0718	51.1771	48.85	6.116	156941.4062
13000	0.4202	5323.2368	21323.9785	1.0547	50.814	49.199	6.16	145089.7812
14000	0.4525	5399.0068	21468.7930	1.0443	50.9066	49.11	6.149	147118.75
15000	0.4848	5341.0449	20151.6465	1.0364	51.0013	49.018	6.137	134312.3438
16000	0.5172	5234.6987	20021.3477	1.0292	51.7235	48.334	6.051	136299.75
17000	0.5495	5317.8687	21308.9355	1.0156	54.7044	45.7	5.722	149495.2656
18000	0.5818	5521.5405	20827.6855	1.0137	41.4159	60.363	7.557	141984.7344
19000	0.6141	5249.7568	20254.2051	1.0055	42.1847	59.263	7.42	124202.625
20000	0.6465	5582.7598	21764.4727	0.9982	46.3033	53.992	6.76	149495.2656
21000	0.6788	5232.6621	20262.7637	0.9935	48.1287	51.944	6.503	145128.5312
22000	0.7111	5320.3491	21332.9902	0.9854	50.6681	49.341	6.177	155605.7656
23000	0.7434	5032.2212	19788.3945	0.9876	50.9899	49.029	6.138	141417.0312
24000	0.7757	5318.2793	22064.2031	0.9832	50.912	49.104	6.148	152560.7188
25000	0.8081	5365.5708	21906.0957	0.9779	51.1379	48.887	6.121	154034.5156
26000	0.8404	5328.6157	22267.3691	0.9740	51.1115	48.913	6.124	154983.75
27000	0.8727	5565.8813	22663.3496	0.9714	32.781	76.264	9.548	152397.8594
28000	0.9050	5278.7847	20380.2637	0.9723	27.108	92.224	11.546	141190.6406
29000	0.9374	5302.2002	20637.6562	0.9657	30.8728	80.977	10.138	139914.2969
30000	0.9697	5366.4053	22920.4629	0.9633	27.0433	92.444	11.574	160202.3281
30938	1.0	5286.9868	20498.4277	0.9628	27.0346	92.474	11.578	145051.0469

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0

distily
/

distily_TinyStories-33M_hs_attn