distily_bench_obj_cross_v2.2

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 28257.9004
eval_frwikippl: 63896.6680
eval_zhwikippl: 90059.6875
eval_tinystoriesppl: 18426.4922
eval_loss: 6.6740
eval_runtime: 13.137
eval_samples_per_second: 76.121
eval_steps_per_second: 9.515

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 4e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.0568 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		169.9865	47377.9414					3.9789	4998.1294
0	0	35507.3906	70936.2969	6.875	13.2774	75.316	9.414	24370.3125	92840.9844
500	0.0404	28284.1875	63896.6680	6.6737	13.1884	75.824	9.478	18447.8379	90059.6875
1000	0.0808	28284.1875	63896.6680	6.6740	13.221	75.637	9.455	18444.7754	90059.6875
1500	0.1212	28284.1875	63896.6680	6.6740	13.1643	75.963	9.495	18444.7754	90059.6875
2000	0.1616	28284.1875	63896.6680	6.6740	13.2331	75.568	9.446	18438.6914	90059.6875
2500	0.2020	28284.1875	63896.6680	6.6740	13.1865	75.835	9.479	18432.5898	90059.6875
3000	0.2424	28257.9004	63896.6680	6.6740	13.246	75.494	9.437	18426.4922	90059.6875
3500	0.2828	28257.9004	63896.6680	6.6740	13.1762	75.895	9.487	18426.4922	90059.6875
4000	0.3232	28257.9004	63896.6680	6.6740	13.3585	74.859	9.357	18426.4922	90059.6875
4500	0.3636	28257.9004	63896.6680	6.6740	13.1842	75.848	9.481	18426.4922	90059.6875
5000	0.4040	28257.9004	63896.6680	6.6740	13.2694	75.361	9.42	18426.4922	90059.6875
5500	0.4444	28257.9004	63896.6680	6.6740	13.2102	75.699	9.462	18426.4922	90059.6875
6000	0.4848	28257.9004	63896.6680	6.6740	13.3012	75.181	9.398	18426.4922	90059.6875
6500	0.5253	28257.9004	63896.6680	6.6740	13.1704	75.928	9.491	18426.4922	90059.6875
7000	0.5657	28257.9004	63896.6680	6.6740	13.2236	75.622	9.453	18426.4922	90059.6875
7500	0.6061	28257.9004	63896.6680	6.6740	13.2333	75.567	9.446	18426.4922	90059.6875
8000	0.6465	28257.9004	63896.6680	6.6740	13.1385	76.112	9.514	18426.4922	90059.6875
8500	0.6869	28257.9004	63896.6680	6.6740	13.2297	75.588	9.448	18426.4922	90059.6875
9000	0.7273	28257.9004	63896.6680	6.6740	13.1073	76.293	9.537	18426.4922	90059.6875
9500	0.7677	28257.9004	63896.6680	6.6740	13.137	76.121	9.515	18426.4922	90059.6875
10000	0.8081	28257.9004	63896.6680	6.6740	13.0862	76.417	9.552	18426.4922	90059.6875
10500	0.8485	28257.9004	63896.6680	6.6740	13.17	75.93	9.491	18426.4922	90059.6875
11000	0.8889	28257.9004	63896.6680	6.6740	13.211	75.694	9.462	18426.4922	90059.6875
11500	0.9293	28257.9004	63896.6680	6.6740	13.1171	76.237	9.53	18426.4922	90059.6875
12000	0.9697	28257.9004	63896.6680	6.6740	13.2484	75.481	9.435	18426.4922	90059.6875
12375	1.0	28257.9004	63896.6680	6.6740	13.2116	75.691	9.461	18426.4922	90059.6875

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.20.0

lapp0
/

distily_bench_obj_cross_v2.2