distily_bench_gpt2_simple_objectives2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 26585.3379
eval_frwikippl: 34195.0625
eval_zhwikippl: 50038.4062
eval_loss: 0.0690
eval_runtime: 32.5862
eval_samples_per_second: 61.376
eval_steps_per_second: 7.672

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: MultiObjective(logits_weight=1, logits_loss_fn=(fn:cosine_distance_loss()), activations_weight=0.2, activations_loss_fn=(fn:soft_mse_loss()), attentions_weight=0, attentions_loss_fn=(fn:soft_mse_loss()))
train_embeddings: True
learning_rate: 4e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 10.3934 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	zhwikippl
teacher eval		30.2086	57.2728					18.1784
0	0	56797.875	58468.6992	1.0	32.7055	61.152	7.644	59002.2891
1000	0.0404	27961.6797	33253.5469	0.0687	32.6144	61.323	7.665	48954.4688
2000	0.0808	28987.5293	34209.5430	0.0692	32.5898	61.369	7.671	49453.8398
3000	0.1212	31211.6582	36006.2383	0.0684	32.5399	61.463	7.683	56097.9492
4000	0.1616	30200.7598	35859.2969	0.0683	32.7214	61.122	7.64	55147.2305
5000	0.2020	29574.1465	34793.3281	0.0687	32.5565	61.432	7.679	52868.5352
6000	0.2424	28607.3867	34403.0312	0.0681	32.4239	61.683	7.71	52629.0312
7000	0.2828	25131.9219	33173.9180	0.0687	32.5743	61.398	7.675	52278.8164
8000	0.3232	27496.5820	34228.8320	0.0682	32.6299	61.293	7.662	52896.7773
9000	0.3636	26585.3379	34195.0625	0.0690	32.5862	61.376	7.672	50038.4062
10000	0.4040	27479.4902	33309.8555	0.0692	32.6173	61.317	7.665	53123.3086
11000	0.4444	29032.5703	32465.8555	0.0687	32.5477	61.448	7.681	54270.5625
12000	0.4848	25965.1055	33366.2578	0.0679	32.6589	61.239	7.655	56639.9258
13000	0.5253	28022.5176	34031.5195	0.0687	32.7256	61.114	7.639	55873.6484
14000	0.5657	29721.4824	33602.3359	0.0682	32.4915	61.555	7.694	54241.5859
15000	0.6061	23164.0742	32511.6797	0.0678	32.6017	61.347	7.668	52727.5469
16000	0.6465	22154.2578	34514.7969	0.0686	32.5602	61.425	7.678	55412.9883
17000	0.6869	28816.9688	36942.2734	0.0689	32.582	61.384	7.673	54677.9062
18000	0.7273	30663.9199	36817.4492	0.0686	32.6509	61.254	7.657	52769.8047
19000	0.7677	31843.2832	37875.9023	0.0682	32.5752	61.396	7.675	53094.9453
20000	0.8081	26705.3535	35377.125	0.0677	32.5515	61.441	7.68	55873.6484
21000	0.8485	31013.5449	35662.6172	0.0671	32.8968	60.796	7.6	53579.2930
22000	0.8889	31917.5293	33950.0234	0.0672	32.7146	61.135	7.642	53779.9688
23000	0.9293	30907.7520	34783.5430	0.0678	32.8995	60.791	7.599	53564.9883
24000	0.9697	30893.3711	34383.6484	0.0676	32.7142	61.135	7.642	54039.1367
24750	1.0	37617.5469	41219.9453	0.0685	32.7176	61.129	7.641	52572.8477

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.20.0

lapp0
/

distily_bench_gpt2_simple_objectives2