distily_bench_gpt2_simple_objectives2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

  • eval_enwikippl: 26585.3379
  • eval_frwikippl: 34195.0625
  • eval_zhwikippl: 50038.4062
  • eval_loss: 0.0690
  • eval_runtime: 32.5862
  • eval_samples_per_second: 61.376
  • eval_steps_per_second: 7.672

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • distillation_objective: MultiObjective(logits_weight=1, logits_loss_fn=(fn:cosine_distance_loss()), activations_weight=0.2, activations_loss_fn=(fn:soft_mse_loss()), attentions_weight=0, attentions_loss_fn=(fn:soft_mse_loss()))
  • train_embeddings: True
  • learning_rate: 4e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: constant
  • num_epochs: 1.0

Resource Usage

Peak GPU Memory: 10.3934 GB

Eval-Phase Metrics

step epoch enwikippl frwikippl loss runtime samples_per_second steps_per_second zhwikippl
teacher eval 30.2086 57.2728 18.1784
0 0 56797.875 58468.6992 1.0 32.7055 61.152 7.644 59002.2891
1000 0.0404 27961.6797 33253.5469 0.0687 32.6144 61.323 7.665 48954.4688
2000 0.0808 28987.5293 34209.5430 0.0692 32.5898 61.369 7.671 49453.8398
3000 0.1212 31211.6582 36006.2383 0.0684 32.5399 61.463 7.683 56097.9492
4000 0.1616 30200.7598 35859.2969 0.0683 32.7214 61.122 7.64 55147.2305
5000 0.2020 29574.1465 34793.3281 0.0687 32.5565 61.432 7.679 52868.5352
6000 0.2424 28607.3867 34403.0312 0.0681 32.4239 61.683 7.71 52629.0312
7000 0.2828 25131.9219 33173.9180 0.0687 32.5743 61.398 7.675 52278.8164
8000 0.3232 27496.5820 34228.8320 0.0682 32.6299 61.293 7.662 52896.7773
9000 0.3636 26585.3379 34195.0625 0.0690 32.5862 61.376 7.672 50038.4062
10000 0.4040 27479.4902 33309.8555 0.0692 32.6173 61.317 7.665 53123.3086
11000 0.4444 29032.5703 32465.8555 0.0687 32.5477 61.448 7.681 54270.5625
12000 0.4848 25965.1055 33366.2578 0.0679 32.6589 61.239 7.655 56639.9258
13000 0.5253 28022.5176 34031.5195 0.0687 32.7256 61.114 7.639 55873.6484
14000 0.5657 29721.4824 33602.3359 0.0682 32.4915 61.555 7.694 54241.5859
15000 0.6061 23164.0742 32511.6797 0.0678 32.6017 61.347 7.668 52727.5469
16000 0.6465 22154.2578 34514.7969 0.0686 32.5602 61.425 7.678 55412.9883
17000 0.6869 28816.9688 36942.2734 0.0689 32.582 61.384 7.673 54677.9062
18000 0.7273 30663.9199 36817.4492 0.0686 32.6509 61.254 7.657 52769.8047
19000 0.7677 31843.2832 37875.9023 0.0682 32.5752 61.396 7.675 53094.9453
20000 0.8081 26705.3535 35377.125 0.0677 32.5515 61.441 7.68 55873.6484
21000 0.8485 31013.5449 35662.6172 0.0671 32.8968 60.796 7.6 53579.2930
22000 0.8889 31917.5293 33950.0234 0.0672 32.7146 61.135 7.642 53779.9688
23000 0.9293 30907.7520 34783.5430 0.0678 32.8995 60.791 7.599 53564.9883
24000 0.9697 30893.3711 34383.6484 0.0676 32.7142 61.135 7.642 54039.1367
24750 1.0 37617.5469 41219.9453 0.0685 32.7176 61.129 7.641 52572.8477

Framework versions

  • Distily 0.2.0
  • Transformers 4.44.0
  • Pytorch 2.3.0
  • Datasets 2.20.0
Downloads last month
2
Safetensors
Model size
124M params
Tensor type
BF16
·
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for lapp0/distily_bench_gpt2_simple_objectives2

Quantized
(53)
this model