lapp0 commited on
Commit
2530bf6
·
verified ·
1 Parent(s): 8077334

Training in progress, step 6188

Browse files
README.md CHANGED
@@ -15,14 +15,14 @@ This student model is distilled from the teacher model [roneneldan/TinyStories-3
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
- - eval_enwikippl: 3588.0862
19
- - eval_frwikippl: 29491.5098
20
- - eval_zhwikippl: 52398.3594
21
- - eval_tinystoriesppl: 1160.5111
22
- - eval_loss: 5.1062
23
- - eval_runtime: 6.5853
24
- - eval_samples_per_second: 75.926
25
- - eval_steps_per_second: 9.567
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
@@ -56,29 +56,29 @@ The following hyperparameters were used during training:
56
  - num_epochs: 1.0
57
 
58
  ### Resource Usage
59
- Peak GPU Memory: 8.0568 GB
60
 
61
  ### Eval-Phase Metrics
62
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
63
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
64
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
65
- | 0 | 0 | 21321.3555 | 56774.5312 | 6.6010 | 6.5827 | 75.956 | 9.57 | 11289.9248 | 60744.7383 |
66
- | 500 | 0.0808 | 3639.8804 | 29520.6055 | 5.1093 | 6.602 | 75.735 | 9.543 | 1181.7124 | 52932.3008 |
67
- | 1000 | 0.1616 | 3605.0818 | 29512.3027 | 5.1083 | 6.5998 | 75.76 | 9.546 | 1168.1150 | 52622.5078 |
68
- | 1500 | 0.2424 | 3596.4351 | 29524.7734 | 5.1073 | 6.6103 | 75.64 | 9.531 | 1163.0084 | 52566.3789 |
69
- | 2000 | 0.3232 | 3585.8628 | 29491.5098 | 5.1077 | 6.6062 | 75.686 | 9.536 | 1158.5942 | 52426.3516 |
70
- | 2500 | 0.4040 | 3586.9744 | 29491.5098 | 5.1077 | 6.6186 | 75.544 | 9.519 | 1159.1688 | 52426.3516 |
71
- | 3000 | 0.4848 | 3585.8628 | 29491.5098 | 5.1077 | 6.5957 | 75.807 | 9.552 | 1158.2108 | 52398.3594 |
72
- | 3500 | 0.5656 | 3585.8628 | 29491.5098 | 5.1077 | 6.6105 | 75.638 | 9.53 | 1158.7859 | 52398.3594 |
73
- | 4000 | 0.6464 | 3585.8628 | 29491.5098 | 5.1077 | 6.6047 | 75.704 | 9.539 | 1158.5942 | 52398.3594 |
74
- | 4500 | 0.7272 | 3586.9744 | 29491.5098 | 5.1077 | 6.6182 | 75.55 | 9.519 | 1158.9771 | 52398.3594 |
75
- | 5000 | 0.8080 | 3585.8628 | 29491.5098 | 5.1077 | 6.594 | 75.827 | 9.554 | 1158.5942 | 52398.3594 |
76
- | 5500 | 0.8888 | 3588.0862 | 29508.1367 | 5.1068 | 6.5974 | 75.787 | 9.549 | 1159.9358 | 52398.3594 |
77
- | 6000 | 0.9696 | 3588.0862 | 29491.5098 | 5.1062 | 6.5958 | 75.805 | 9.551 | 1160.1277 | 52398.3594 |
78
- | 6188 | 1.0 | 3588.0862 | 29491.5098 | 5.1062 | 6.5853 | 75.926 | 9.567 | 1160.5111 | 52398.3594 |
79
 
80
  ### Framework versions
81
  - Distily 0.2.0
82
  - Transformers 4.44.0
83
  - Pytorch 2.3.0
84
- - Datasets 2.20.0
 
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
+ - eval_enwikippl: 7370.6421
19
+ - eval_frwikippl: 36625.3633
20
+ - eval_zhwikippl: 69136.0938
21
+ - eval_tinystoriesppl: 3403.9065
22
+ - eval_loss: 5.1768
23
+ - eval_runtime: 6.4845
24
+ - eval_samples_per_second: 77.107
25
+ - eval_steps_per_second: 9.715
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
 
56
  - num_epochs: 1.0
57
 
58
  ### Resource Usage
59
+ Peak GPU Memory: 8.0557 GB
60
 
61
  ### Eval-Phase Metrics
62
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
63
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
64
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
65
+ | 0 | 0 | 43423.2812 | 70766.6328 | 6.6982 | 6.5086 | 76.821 | 9.679 | 33276.4844 | 75720.9297 |
66
+ | 500 | 0.0808 | 7652.8320 | 36775.2695 | 5.1768 | 6.4848 | 77.103 | 9.715 | 3608.4768 | 70684.1406 |
67
+ | 1000 | 0.1616 | 7409.5664 | 36723.5039 | 5.1768 | 6.491 | 77.03 | 9.706 | 3437.2712 | 69450.3828 |
68
+ | 1500 | 0.2424 | 7313.7646 | 36645.9766 | 5.1778 | 6.4918 | 77.02 | 9.705 | 3335.3831 | 68878.3828 |
69
+ | 2000 | 0.3232 | 7313.7646 | 36645.9766 | 5.1778 | 6.4851 | 77.099 | 9.715 | 3339.7979 | 68841.6016 |
70
+ | 2500 | 0.4040 | 7354.6680 | 36625.3633 | 5.1772 | 6.49 | 77.042 | 9.707 | 3393.2302 | 69062.3516 |
71
+ | 3000 | 0.4848 | 7388.9336 | 36656.3242 | 5.1762 | 6.5016 | 76.905 | 9.69 | 3415.7463 | 69173.0234 |
72
+ | 3500 | 0.5656 | 7393.5151 | 36676.9883 | 5.1762 | 6.4831 | 77.123 | 9.718 | 3418.0046 | 69173.0234 |
73
+ | 4000 | 0.6464 | 7359.2285 | 36645.9766 | 5.1772 | 6.4881 | 77.064 | 9.71 | 3393.2302 | 69062.3516 |
74
+ | 4500 | 0.7272 | 7320.5684 | 36645.9766 | 5.1772 | 6.486 | 77.089 | 9.713 | 3356.4048 | 69025.5469 |
75
+ | 5000 | 0.8080 | 7320.5684 | 36645.9766 | 5.1772 | 6.6011 | 75.745 | 9.544 | 3351.9680 | 68988.6953 |
76
+ | 5500 | 0.8888 | 7327.3711 | 36625.3633 | 5.1778 | 6.5132 | 76.767 | 9.673 | 3361.9597 | 69025.5469 |
77
+ | 6000 | 0.9696 | 7384.3545 | 36625.3633 | 5.1762 | 6.4883 | 77.062 | 9.71 | 3409.5400 | 69173.0234 |
78
+ | 6188 | 1.0 | 7370.6421 | 36625.3633 | 5.1768 | 6.4845 | 77.107 | 9.715 | 3403.9065 | 69136.0938 |
79
 
80
  ### Framework versions
81
  - Distily 0.2.0
82
  - Transformers 4.44.0
83
  - Pytorch 2.3.0
84
+ - Datasets 2.21.0
logs/attn_loss_fn=None, attn_weight=0, gradient_accumulation_steps=1, hs_loss_fn=0, hs_weight=0, learning_rate=0.0004, lr_scheduler_type=constant_with_warmup, max_grad_norm=1.0, num_warmup_steps=0, optim=p/events.out.tfevents.1723839050.5f530b1cf724 CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f31f94600fbce10692afe98d1d981038b3e5647d97e7b7a9d771b06cee49dbed
3
- size 307
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:701c98471a1b7101f30262a592c7e32461b08db16eff1d905b1cc67268ed24f7
3
+ size 578
logs/attn_loss_fn=None, attn_weight=0, gradient_accumulation_steps=1, hs_loss_fn=0, hs_weight=0, learning_rate=0.0004, lr_scheduler_type=constant_with_warmup, max_grad_norm=1.0, num_warmup_steps=1000, opti/events.out.tfevents.1723839267.5f530b1cf724 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cbb63e796ae75f55df680ff655ee7d6a41b4c0b18ddcbf275072c65549cbcc81
3
+ size 2932929
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b4cd16d7352f58b542ae2b31e986e20aa1ad58363876d0fdb552464f26eff300
3
  size 137033984
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:704365b2903f4f9092aa0d2bac61b7186825189c62f717760b29a73900327a4a
3
  size 137033984
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ae77ed9d5da881c82ec366e8d74e46f1a9fe6f68c6877f4450a9c37640920326
3
  size 1017948232
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e27dca91af41039e47b1d6a0fb0b33c27148d47b49cfa77503c16cc0d5db6bdb
3
  size 1017948232