distily_TinyStories-33M

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

  • eval_enwikippl: 5505.2720
  • eval_frwikippl: 21773.6699
  • eval_zhwikippl: 149216.0938
  • eval_loss: 1.1383
  • eval_runtime: 51.1413
  • eval_samples_per_second: 48.884
  • eval_steps_per_second: 6.12

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=5000.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=500.0, loss_fn=jsd, layer_mapper=None, projector=None))
  • train_embeddings: True
  • learning_rate: 4e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: constant
  • num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.2949 GB

Eval-Phase Metrics

step epoch enwikippl frwikippl loss runtime samples_per_second steps_per_second zhwikippl
teacher eval 20633.1680 131577.2812 7615.4468
0 0 57409.7656 57878.0820 11.7972 40.6672 61.475 7.697 56928.0781
1000 0.0323 10372.9512 76930.4531 1.9053 41.7953 59.815 7.489 858113.625
2000 0.0646 8020.6040 46711.9688 1.6472 41.0642 60.88 7.622 367518.3125
3000 0.0970 8157.5376 45240.3945 1.5278 45.4508 55.005 6.887 515510.5625
4000 0.1293 7411.5596 36822.6484 1.4337 51.1158 48.909 6.123 421034.4688
5000 0.1616 6422.7583 28339.4023 1.3515 51.1748 48.852 6.116 267027.4375
6000 0.1939 6131.3276 24695.6113 1.2750 50.9734 49.045 6.14 194273.2656
7000 0.2263 5802.4341 23374.1562 1.2199 50.8571 49.157 6.155 168406.4688
8000 0.2586 5621.9170 21168.1855 1.1773 51.0097 49.01 6.136 164012.0469
9000 0.2909 5505.2720 21773.6699 1.1383 51.1413 48.884 6.12 149216.0938
10000 0.3232 5617.5493 21623.7461 1.1134 51.0853 48.938 6.127 148977.0625
11000 0.3555 5438.9810 21305.9277 1.0901 51.2289 48.801 6.11 148262.7188
12000 0.3879 5601.4360 22292.5059 1.0718 51.1771 48.85 6.116 156941.4062
13000 0.4202 5323.2368 21323.9785 1.0547 50.814 49.199 6.16 145089.7812
14000 0.4525 5399.0068 21468.7930 1.0443 50.9066 49.11 6.149 147118.75
15000 0.4848 5341.0449 20151.6465 1.0364 51.0013 49.018 6.137 134312.3438
16000 0.5172 5234.6987 20021.3477 1.0292 51.7235 48.334 6.051 136299.75
17000 0.5495 5317.8687 21308.9355 1.0156 54.7044 45.7 5.722 149495.2656
18000 0.5818 5521.5405 20827.6855 1.0137 41.4159 60.363 7.557 141984.7344
19000 0.6141 5249.7568 20254.2051 1.0055 42.1847 59.263 7.42 124202.625
20000 0.6465 5582.7598 21764.4727 0.9982 46.3033 53.992 6.76 149495.2656
21000 0.6788 5232.6621 20262.7637 0.9935 48.1287 51.944 6.503 145128.5312
22000 0.7111 5320.3491 21332.9902 0.9854 50.6681 49.341 6.177 155605.7656
23000 0.7434 5032.2212 19788.3945 0.9876 50.9899 49.029 6.138 141417.0312
24000 0.7757 5318.2793 22064.2031 0.9832 50.912 49.104 6.148 152560.7188
25000 0.8081 5365.5708 21906.0957 0.9779 51.1379 48.887 6.121 154034.5156
26000 0.8404 5328.6157 22267.3691 0.9740 51.1115 48.913 6.124 154983.75
27000 0.8727 5565.8813 22663.3496 0.9714 32.781 76.264 9.548 152397.8594
28000 0.9050 5278.7847 20380.2637 0.9723 27.108 92.224 11.546 141190.6406
29000 0.9374 5302.2002 20637.6562 0.9657 30.8728 80.977 10.138 139914.2969
30000 0.9697 5366.4053 22920.4629 0.9633 27.0433 92.444 11.574 160202.3281
30938 1.0 5286.9868 20498.4277 0.9628 27.0346 92.474 11.578 145051.0469

Framework versions

  • Distily 0.2.0
  • Transformers 4.44.0
  • Pytorch 2.3.0
  • Datasets 2.21.0
Downloads last month
6
Safetensors
Model size
68.5M params
Tensor type
BF16
·
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for distily/distily_TinyStories-33M_hs_attn

Finetuned
(8)
this model