ecker commited on
Commit
36c9b57
·
verified ·
1 Parent(s): ce4bc83

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -11
README.md CHANGED
@@ -57,6 +57,10 @@ This repo contains the following configurations under `./models/`:
57
  + Classifier-free-guidance-aware-training was also performed, really helping the prompt adherence even at ar-temperature=1.0.
58
  + Regression tests are needed just in case I did botch something, but it seems really nice so far.
59
  + The old weights are saved as `ar+nar-old-llama-8` in the event of a nasty regression, but I doubt it's necessary.
 
 
 
 
60
 
61
  * ~~`config.llama-tts+stt.yaml` / `ar+nar-tts+stt-llama-8`~~: The above, but with partially trained for STT.
62
  + These weights use the above weights but with additional training for the default `tts` task and a new `stt` task (at a 3:1 ratio).
@@ -90,19 +94,15 @@ This repo contains the following configurations under `./models/`:
90
  * Despite being a failure, this does pave a nice way to shrink models from an existing model. However, this does not seem to be useful as even dropping two/three layers really does harm how well the prompt is followed.
91
 
92
  * `config.llama[nar-len].yaml` / `nar-len-llama-8`: A fully non-autoregressive model.
93
- * These weights are a work in progress, but currently are a good proof-of-concept so far until training is on-par with the base `ar+nar-llama-8` model.
94
- * A ***lot*** of pain was put into trying to get something working, through implementation issues to dumb mistakes, until the best option of just training from scratch was picked.
95
- * Technically, the `ar+nar-llama-8` can be modified to be a pure non-autoregressive model, but I needed to start from scratch before dumping more time again trying to adapt it.
96
- * Speedups are immense compared to the `ar+nar-llama-8`, as the entire audio output is decoded in parallel rather than causally.
97
  * Throughput and memory usage should be constant between inferencing steps.
98
- * The model only needs to be invoked about 5+25+7 (duration inferencing + RVQ level 0 inferencing + remaining RVQ levels) instead.
99
- * Seems to absolutely require classifier-free-guidance to keep the output stable.
100
- * The "confidence" issue on voices it hasn't seen / hasn't seen much of is much more noticeable as RVQ level 0 is much more susceptable to it.
101
  * Unlike the base model, this is trained with the current dataset without iteratively dripfeeding additional sources (like tacking on Emilia afterwards).
102
- * ...except STT, this received no STT training out of fear of botching the model.
103
- * Weights will be added as the model is trained.
104
- * This *was* expected to be a dud, but one very, very small oversight in the sampling code proved to be the culrpit......
105
- * In other words, the model *does* work.
106
 
107
  Some additional configurations have been explored with, but experiments have not been fruitful:
108
  * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
 
57
  + Classifier-free-guidance-aware-training was also performed, really helping the prompt adherence even at ar-temperature=1.0.
58
  + Regression tests are needed just in case I did botch something, but it seems really nice so far.
59
  + The old weights are saved as `ar+nar-old-llama-8` in the event of a nasty regression, but I doubt it's necessary.
60
+ * Addendum: This received more training for NAR-len training for an inferencing mode with huge speedups.
61
+ * Despite the model *technically* receiving some (wrong) training for this modality, it does work enough from an existing model, albeit not with quality on par with the base AR+NAR modality.
62
+ * Weights will update as training progresses for NAR-len, and may pivot to be the default modality.
63
+ * If all goes well, these weights will revert back to the original snapshot, while the reference model will be renamed to `ar+nar-len-llama-8` instead.
64
 
65
  * ~~`config.llama-tts+stt.yaml` / `ar+nar-tts+stt-llama-8`~~: The above, but with partially trained for STT.
66
  + These weights use the above weights but with additional training for the default `tts` task and a new `stt` task (at a 3:1 ratio).
 
94
  * Despite being a failure, this does pave a nice way to shrink models from an existing model. However, this does not seem to be useful as even dropping two/three layers really does harm how well the prompt is followed.
95
 
96
  * `config.llama[nar-len].yaml` / `nar-len-llama-8`: A fully non-autoregressive model.
97
+ * These weights are mostly a experiment to ensure that a pure NAR model works (through demasking inferencing).
98
+ * A ***lot*** of pain was put into trying to get something working, through implementation issues to dumb mistakes.
 
 
99
  * Throughput and memory usage should be constant between inferencing steps.
100
+ * The model only needs to be invoked about 5+(25+7)*2 (duration inferencing + RVQ level 0 inferencing + remaining RVQ levels, double for CFG) instead.
101
+ * Seems to absolutely require classifier-free-guidance >= 2.0 to keep the output stable (but this does replace the need of rep pen + low temp, even for normal AR+NAR).
 
102
  * Unlike the base model, this is trained with the current dataset without iteratively dripfeeding additional sources (like tacking on Emilia afterwards).
103
+ * This *was* slated as a dud until the final oversight was squashed in the inferencing code, but it works *almost-decently* as a TTS model.
104
+ * The output quality itself leaves a lot to be desired.
105
+ * Training is finalized with this model, as dedicated training time to the base model to extend it for NAR-len capabilities is optimal, but these weights will exist for who it may concern.
 
106
 
107
  Some additional configurations have been explored with, but experiments have not been fruitful:
108
  * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.