Update README.md
Browse files
README.md
CHANGED
@@ -65,16 +65,15 @@ This repo contains the following configurations under `./models/`:
|
|
65 |
+ Addendum: this replaced the `ar+nar-llama-8` as the defacto model (taking its name), so the above does apply.
|
66 |
|
67 |
* `config.llama[layerskip].yaml` / `ar+nar-layerskip-llama-8`: The above, but with very brief training for LayerSkip:
|
68 |
-
+
|
69 |
+ Using shuffled batches (where each batch has the same durations) and a modified `rvq_levels_p` to help the NAR.
|
70 |
+ This model received LayerSkip-aware training, with layer dropout and early-exit loss to help try and bolster the model and enable self-speculation sampling.
|
71 |
+ I *need* to do heavy evaluation against the base model to ensure output quality does not drop before considering replacing the base model with this.
|
72 |
-
+ It currently does not seem to perform better even without early-exit...
|
73 |
+ Goal is to utilize self-speculation sampling to enable speedups when possible.
|
74 |
-
+ Current implementation will early-exit if the entropy/varentropy of the logits are low enough
|
75 |
-
|
76 |
-
|
77 |
-
+
|
78 |
|
79 |
Some additional configurations have been explored with, but experiments have not been fruitful:
|
80 |
* Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
|
|
|
65 |
+ Addendum: this replaced the `ar+nar-llama-8` as the defacto model (taking its name), so the above does apply.
|
66 |
|
67 |
* `config.llama[layerskip].yaml` / `ar+nar-layerskip-llama-8`: The above, but with very brief training for LayerSkip:
|
68 |
+
+ Post-trained on a small English subset of Emilia and a small private corpus, and Japanese+French+German from Emilia.
|
69 |
+ Using shuffled batches (where each batch has the same durations) and a modified `rvq_levels_p` to help the NAR.
|
70 |
+ This model received LayerSkip-aware training, with layer dropout and early-exit loss to help try and bolster the model and enable self-speculation sampling.
|
71 |
+ I *need* to do heavy evaluation against the base model to ensure output quality does not drop before considering replacing the base model with this.
|
|
|
72 |
+ Goal is to utilize self-speculation sampling to enable speedups when possible.
|
73 |
+
+ Current implementation will early-exit if the entropy/varentropy of the logits are low enough
|
74 |
+
+ Training is a pain.
|
75 |
+
+ LayerSkip-aware training does *not* like to train under ROCm.
|
76 |
+
+ Training under float16+AMP with loss scaling will fry the model with a large enough de facto batch size (>512 samples/update step) and/or too low of a loss scale (<=8K).
|
77 |
|
78 |
Some additional configurations have been explored with, but experiments have not been fruitful:
|
79 |
* Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
|