Update README.md
Browse files
README.md
CHANGED
@@ -57,6 +57,10 @@ This repo contains the following configurations under `./models/`:
|
|
57 |
+ Classifier-free-guidance-aware-training was also performed, really helping the prompt adherence even at ar-temperature=1.0.
|
58 |
+ Regression tests are needed just in case I did botch something, but it seems really nice so far.
|
59 |
+ The old weights are saved as `ar+nar-old-llama-8` in the event of a nasty regression, but I doubt it's necessary.
|
|
|
|
|
|
|
|
|
60 |
|
61 |
* ~~`config.llama-tts+stt.yaml` / `ar+nar-tts+stt-llama-8`~~: The above, but with partially trained for STT.
|
62 |
+ These weights use the above weights but with additional training for the default `tts` task and a new `stt` task (at a 3:1 ratio).
|
@@ -90,19 +94,15 @@ This repo contains the following configurations under `./models/`:
|
|
90 |
* Despite being a failure, this does pave a nice way to shrink models from an existing model. However, this does not seem to be useful as even dropping two/three layers really does harm how well the prompt is followed.
|
91 |
|
92 |
* `config.llama[nar-len].yaml` / `nar-len-llama-8`: A fully non-autoregressive model.
|
93 |
-
* These weights are a
|
94 |
-
* A ***lot*** of pain was put into trying to get something working, through implementation issues to dumb mistakes
|
95 |
-
* Technically, the `ar+nar-llama-8` can be modified to be a pure non-autoregressive model, but I needed to start from scratch before dumping more time again trying to adapt it.
|
96 |
-
* Speedups are immense compared to the `ar+nar-llama-8`, as the entire audio output is decoded in parallel rather than causally.
|
97 |
* Throughput and memory usage should be constant between inferencing steps.
|
98 |
-
* The model only needs to be invoked about 5+25+7 (duration inferencing + RVQ level 0 inferencing + remaining RVQ levels) instead.
|
99 |
-
* Seems to absolutely require classifier-free-guidance to keep the output stable.
|
100 |
-
* The "confidence" issue on voices it hasn't seen / hasn't seen much of is much more noticeable as RVQ level 0 is much more susceptable to it.
|
101 |
* Unlike the base model, this is trained with the current dataset without iteratively dripfeeding additional sources (like tacking on Emilia afterwards).
|
102 |
-
|
103 |
-
|
104 |
-
|
105 |
-
* In other words, the model *does* work.
|
106 |
|
107 |
Some additional configurations have been explored with, but experiments have not been fruitful:
|
108 |
* Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
|
|
|
57 |
+ Classifier-free-guidance-aware-training was also performed, really helping the prompt adherence even at ar-temperature=1.0.
|
58 |
+ Regression tests are needed just in case I did botch something, but it seems really nice so far.
|
59 |
+ The old weights are saved as `ar+nar-old-llama-8` in the event of a nasty regression, but I doubt it's necessary.
|
60 |
+
* Addendum: This received more training for NAR-len training for an inferencing mode with huge speedups.
|
61 |
+
* Despite the model *technically* receiving some (wrong) training for this modality, it does work enough from an existing model, albeit not with quality on par with the base AR+NAR modality.
|
62 |
+
* Weights will update as training progresses for NAR-len, and may pivot to be the default modality.
|
63 |
+
* If all goes well, these weights will revert back to the original snapshot, while the reference model will be renamed to `ar+nar-len-llama-8` instead.
|
64 |
|
65 |
* ~~`config.llama-tts+stt.yaml` / `ar+nar-tts+stt-llama-8`~~: The above, but with partially trained for STT.
|
66 |
+ These weights use the above weights but with additional training for the default `tts` task and a new `stt` task (at a 3:1 ratio).
|
|
|
94 |
* Despite being a failure, this does pave a nice way to shrink models from an existing model. However, this does not seem to be useful as even dropping two/three layers really does harm how well the prompt is followed.
|
95 |
|
96 |
* `config.llama[nar-len].yaml` / `nar-len-llama-8`: A fully non-autoregressive model.
|
97 |
+
* These weights are mostly a experiment to ensure that a pure NAR model works (through demasking inferencing).
|
98 |
+
* A ***lot*** of pain was put into trying to get something working, through implementation issues to dumb mistakes.
|
|
|
|
|
99 |
* Throughput and memory usage should be constant between inferencing steps.
|
100 |
+
* The model only needs to be invoked about 5+(25+7)*2 (duration inferencing + RVQ level 0 inferencing + remaining RVQ levels, double for CFG) instead.
|
101 |
+
* Seems to absolutely require classifier-free-guidance >= 2.0 to keep the output stable (but this does replace the need of rep pen + low temp, even for normal AR+NAR).
|
|
|
102 |
* Unlike the base model, this is trained with the current dataset without iteratively dripfeeding additional sources (like tacking on Emilia afterwards).
|
103 |
+
* This *was* slated as a dud until the final oversight was squashed in the inferencing code, but it works *almost-decently* as a TTS model.
|
104 |
+
* The output quality itself leaves a lot to be desired.
|
105 |
+
* Training is finalized with this model, as dedicated training time to the base model to extend it for NAR-len capabilities is optimal, but these weights will exist for who it may concern.
|
|
|
106 |
|
107 |
Some additional configurations have been explored with, but experiments have not been fruitful:
|
108 |
* Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
|