ecker commited on
Commit
089db71
1 Parent(s): 0966260

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -6
README.md CHANGED
@@ -65,16 +65,15 @@ This repo contains the following configurations under `./models/`:
65
  + Addendum: this replaced the `ar+nar-llama-8` as the defacto model (taking its name), so the above does apply.
66
 
67
  * `config.llama[layerskip].yaml` / `ar+nar-layerskip-llama-8`: The above, but with very brief training for LayerSkip:
68
- + Trained on a small English subset of Emilia and a small private corpus, and Japanese+French+German from Emilia.
69
  + Using shuffled batches (where each batch has the same durations) and a modified `rvq_levels_p` to help the NAR.
70
  + This model received LayerSkip-aware training, with layer dropout and early-exit loss to help try and bolster the model and enable self-speculation sampling.
71
  + I *need* to do heavy evaluation against the base model to ensure output quality does not drop before considering replacing the base model with this.
72
- + It currently does not seem to perform better even without early-exit...
73
  + Goal is to utilize self-speculation sampling to enable speedups when possible.
74
- + Current implementation will early-exit if the entropy/varentropy of the logits are low enough.
75
- + There doesn't seem to be any significant speedups...
76
- + Training is a pain, as float16 + AMP will fry the model fast, and training bfloat16 (with/without AMP) seems to harm the model overall.
77
- + I'd like to think more time training will help, but it doesn't seem to be worth it for a marginal speedup.
78
 
79
  Some additional configurations have been explored with, but experiments have not been fruitful:
80
  * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
 
65
  + Addendum: this replaced the `ar+nar-llama-8` as the defacto model (taking its name), so the above does apply.
66
 
67
  * `config.llama[layerskip].yaml` / `ar+nar-layerskip-llama-8`: The above, but with very brief training for LayerSkip:
68
+ + Post-trained on a small English subset of Emilia and a small private corpus, and Japanese+French+German from Emilia.
69
  + Using shuffled batches (where each batch has the same durations) and a modified `rvq_levels_p` to help the NAR.
70
  + This model received LayerSkip-aware training, with layer dropout and early-exit loss to help try and bolster the model and enable self-speculation sampling.
71
  + I *need* to do heavy evaluation against the base model to ensure output quality does not drop before considering replacing the base model with this.
 
72
  + Goal is to utilize self-speculation sampling to enable speedups when possible.
73
+ + Current implementation will early-exit if the entropy/varentropy of the logits are low enough
74
+ + Training is a pain.
75
+ + LayerSkip-aware training does *not* like to train under ROCm.
76
+ + Training under float16+AMP with loss scaling will fry the model with a large enough de facto batch size (>512 samples/update step) and/or too low of a loss scale (<=8K).
77
 
78
  Some additional configurations have been explored with, but experiments have not been fruitful:
79
  * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.