lapp0 commited on
Commit
dcea4ea
·
verified ·
1 Parent(s): e147beb

End of training

Browse files
README.md CHANGED
@@ -15,14 +15,14 @@ This student model is distilled from the teacher model [roneneldan/TinyStories-3
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
- - eval_enwikippl: 4891.7905
19
- - eval_frwikippl: 35673.2305
20
- - eval_zhwikippl: 32045.9043
21
- - eval_tinystoriesppl: 1523.1017
22
- - eval_loss: 4.8703
23
- - eval_runtime: 6.5675
24
- - eval_samples_per_second: 76.132
25
- - eval_steps_per_second: 9.593
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
@@ -47,12 +47,10 @@ More information needed
47
  The following hyperparameters were used during training:
48
  - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=0, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
49
  - train_embeddings: True
50
- - learning_rate: 0.0004
51
  - train_batch_size: 8
52
  - eval_batch_size: 8
53
  - seed: 42
54
- - gradient_accumulation_steps: 8
55
- - total_train_batch_size: 64
56
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
57
  - lr_scheduler_type: constant_with_warmup
58
  - num_epochs: 1.0
@@ -64,9 +62,20 @@ Peak GPU Memory: 8.0568 GB
64
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
65
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
66
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
67
- | 0 | 0 | 28486.4824 | 70517.9062 | 6.3282 | 6.5373 | 76.484 | 9.637 | 14258.5928 | 39903.9922 |
68
- | 500 | 0.6464 | 4952.0312 | 35693.3398 | 4.8720 | 6.4924 | 77.014 | 9.704 | 1551.0551 | 32286.1875 |
69
- | 773 | 0.9994 | 4891.7905 | 35673.2305 | 4.8703 | 6.5675 | 76.132 | 9.593 | 1523.1017 | 32045.9043 |
 
 
 
 
 
 
 
 
 
 
 
70
 
71
  ### Framework versions
72
  - Distily 0.2.0
 
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
+ - eval_enwikippl: 181.7250
19
+ - eval_frwikippl: 74363.8594
20
+ - eval_zhwikippl: 2022567.625
21
+ - eval_tinystoriesppl: 9.8853
22
+ - eval_loss: 1.1866
23
+ - eval_runtime: 6.5139
24
+ - eval_samples_per_second: 76.759
25
+ - eval_steps_per_second: 9.672
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
 
47
  The following hyperparameters were used during training:
48
  - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=0, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
49
  - train_embeddings: True
50
+ - learning_rate: 0.004
51
  - train_batch_size: 8
52
  - eval_batch_size: 8
53
  - seed: 42
 
 
54
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
55
  - lr_scheduler_type: constant_with_warmup
56
  - num_epochs: 1.0
 
62
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
63
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
64
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
65
+ | 0 | 0 | 28486.4824 | 70517.9062 | 6.3282 | 6.4967 | 76.962 | 9.697 | 14258.5928 | 39903.9922 |
66
+ | 500 | 0.0808 | 273.2714 | 228192.9688 | 1.3967 | 6.4879 | 77.066 | 9.71 | 11.1741 | 4322949.5 |
67
+ | 1000 | 0.1616 | 243.0954 | 144923.9688 | 1.2256 | 6.4896 | 77.046 | 9.708 | 10.8133 | 7570136.5 |
68
+ | 1500 | 0.2424 | 203.6588 | 90912.9609 | 1.1940 | 6.4809 | 77.149 | 9.721 | 10.8402 | 2947090.25 |
69
+ | 2000 | 0.3232 | 200.2716 | 83404.1484 | 1.1897 | 6.4988 | 76.937 | 9.694 | 10.7438 | 2592153.75 |
70
+ | 2500 | 0.4040 | 184.8414 | 76467.0 | 1.1874 | 6.4917 | 77.021 | 9.705 | 9.9579 | 2185851.5 |
71
+ | 3000 | 0.4848 | 181.0154 | 75556.9688 | 1.1875 | 6.4856 | 77.094 | 9.714 | 9.7151 | 2026889.125 |
72
+ | 3500 | 0.5656 | 180.5813 | 75770.125 | 1.1868 | 6.4769 | 77.198 | 9.727 | 9.7699 | 2169583.75 |
73
+ | 4000 | 0.6464 | 183.1808 | 76985.7891 | 1.1867 | 6.4852 | 77.099 | 9.714 | 9.8759 | 2138550.5 |
74
+ | 4500 | 0.7272 | 181.8940 | 75908.9922 | 1.1866 | 6.4932 | 77.004 | 9.703 | 9.7946 | 2117543.5 |
75
+ | 5000 | 0.8080 | 182.3313 | 75185.3516 | 1.1865 | 6.4852 | 77.098 | 9.714 | 9.8894 | 2145408.0 |
76
+ | 5500 | 0.8888 | 181.7320 | 76262.6484 | 1.1871 | 6.4902 | 77.039 | 9.707 | 9.6759 | 2123201.5 |
77
+ | 6000 | 0.9696 | 183.1027 | 76574.7891 | 1.1868 | 6.488 | 77.066 | 9.71 | 9.8169 | 2035559.625 |
78
+ | 6188 | 1.0 | 181.7250 | 74363.8594 | 1.1866 | 6.5139 | 76.759 | 9.672 | 9.8853 | 2022567.625 |
79
 
80
  ### Framework versions
81
  - Distily 0.2.0
logs/attn_loss_fn=None, attn_weight=0, gradient_accumulation_steps=1, hs_loss_fn=0, hs_weight=0, learning_rate=0.004, lr_scheduler_type=constant_with_warmup, max_grad_norm=1.0, num_warmup_steps=0, optim=pa/events.out.tfevents.1723843837.5f530b1cf724 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d9ff6dafb128d8e55f05251b761af97a48989ac5ec545d7edc84057fade63741
3
+ size 307