ecker
/

vall-e

Model card Files Files and versions Community

ecker commited on Nov 20, 2024

Commit

36c9b57

verified ·

1 Parent(s): ce4bc83

Update README.md

Browse files

Files changed (1) hide show

README.md +11 -11

README.md CHANGED Viewed

@@ -57,6 +57,10 @@ This repo contains the following configurations under `./models/`:
       + Classifier-free-guidance-aware-training was also performed, really helping the prompt adherence even at ar-temperature=1.0.
       + Regression tests are needed just in case I did botch something, but it seems really nice so far.
         + The old weights are saved as `ar+nar-old-llama-8` in the event of a nasty regression, but I doubt it's necessary.
 * ~~`config.llama-tts+stt.yaml` / `ar+nar-tts+stt-llama-8`~~: The above, but with partially trained for STT.
     + These weights use the above weights but with additional training for the default `tts` task and a new `stt` task (at a 3:1 ratio).
@@ -90,19 +94,15 @@ This repo contains the following configurations under `./models/`:
       * Despite being a failure, this does pave a nice way to shrink models from an existing model. However, this does not seem to be useful as even dropping two/three layers really does harm how well the prompt is followed.
 * `config.llama[nar-len].yaml` / `nar-len-llama-8`: A fully non-autoregressive model.
-  * These weights are a work in progress, but currently are a good proof-of-concept so far until training is on-par with the base `ar+nar-llama-8` model.
-  * A ***lot*** of pain was put into trying to get something working, through implementation issues to dumb mistakes, until the best option of just training from scratch was picked.
-    * Technically, the `ar+nar-llama-8` can be modified to be a pure non-autoregressive model, but I needed to start from scratch before dumping more time again trying to adapt it.
-  * Speedups are immense compared to the `ar+nar-llama-8`, as the entire audio output is decoded in parallel rather than causally.
     * Throughput and memory usage should be constant between inferencing steps.
-    * The model only needs to be invoked about 5+25+7 (duration inferencing + RVQ level 0 inferencing + remaining RVQ levels) instead.
-  * Seems to absolutely require classifier-free-guidance to keep the output stable.
-  * The "confidence" issue on voices it hasn't seen / hasn't seen much of is much more noticeable as RVQ level 0 is much more susceptable to it.
   * Unlike the base model, this is trained with the current dataset without iteratively dripfeeding additional sources (like tacking on Emilia afterwards).
-    * ...except STT, this received no STT training out of fear of botching the model.
-  * Weights will be added as the model is trained.
-    * This *was* expected to be a dud, but one very, very small oversight in the sampling code proved to be the culrpit......
-    * In other words, the model *does* work.
 Some additional configurations have been explored with, but experiments have not been fruitful:
 * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.

       + Classifier-free-guidance-aware-training was also performed, really helping the prompt adherence even at ar-temperature=1.0.
       + Regression tests are needed just in case I did botch something, but it seems really nice so far.
         + The old weights are saved as `ar+nar-old-llama-8` in the event of a nasty regression, but I doubt it's necessary.
+    * Addendum: This received more training for NAR-len training for an inferencing mode with huge speedups.
+      * Despite the model *technically* receiving some (wrong) training for this modality, it does work enough from an existing model, albeit not with quality on par with the base AR+NAR modality.
+      * Weights will update as training progresses for NAR-len, and may pivot to be the default modality.
+        * If all goes well, these weights will revert back to the original snapshot, while the reference model will be renamed to `ar+nar-len-llama-8` instead.
 * ~~`config.llama-tts+stt.yaml` / `ar+nar-tts+stt-llama-8`~~: The above, but with partially trained for STT.
     + These weights use the above weights but with additional training for the default `tts` task and a new `stt` task (at a 3:1 ratio).
       * Despite being a failure, this does pave a nice way to shrink models from an existing model. However, this does not seem to be useful as even dropping two/three layers really does harm how well the prompt is followed.
 * `config.llama[nar-len].yaml` / `nar-len-llama-8`: A fully non-autoregressive model.
+  * These weights are mostly a experiment to ensure that a pure NAR model works (through demasking inferencing).
+  * A ***lot*** of pain was put into trying to get something working, through implementation issues to dumb mistakes.
     * Throughput and memory usage should be constant between inferencing steps.
+    * The model only needs to be invoked about 5+(25+7)*2 (duration inferencing + RVQ level 0 inferencing + remaining RVQ levels, double for CFG) instead.
+  * Seems to absolutely require classifier-free-guidance >= 2.0 to keep the output stable (but this does replace the need of rep pen + low temp, even for normal AR+NAR).
   * Unlike the base model, this is trained with the current dataset without iteratively dripfeeding additional sources (like tacking on Emilia afterwards).
+  * This *was* slated as a dud until the final oversight was squashed in the inferencing code, but it works *almost-decently* as a TTS model.
+    * The output quality itself leaves a lot to be desired.
+  * Training is finalized with this model, as dedicated training time to the base model to extend it for NAR-len capabilities is optimal, but these weights will exist for who it may concern.
 Some additional configurations have been explored with, but experiments have not been fruitful:
 * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.