ecker
/

vall-e

Model card Files Files and versions Community

ecker commited on 4 days ago

Commit

6d44a08

•

1 Parent(s): 2e4dbd6

Update README.md

Browse files

Files changed (1) hide show

README.md +22 -23

README.md CHANGED Viewed

@@ -23,29 +23,36 @@ This repo contains the following configurations under `./models/`:
 	+ Prior testing showed that longer prompt durations results in better utterances.
     + *Can* benefit from additional training, but I recall the average loss being around `1.9` to `2.1`.
         + However, due to regressions (or bias from working under `llama`), I don't think I can optimially train with a RetNet again (both in terms of VRAM consumption and throughput).
 * `config.llama.yaml` / `ar+nar-llama-8`: The most recent-ishly trained weights after learning from my mistakes.
 	+ This configuration utilizes Llama's attention-based transformer as the underlying architecture, making use of creature comforts like RoPE, GQA, and memory-efficient attention (trained under `xformers`, shouldn't really affect things).
-		+ Prompt and response embeddings are NOT summed (each RVQ level only attends to the current RVQ level).
 		+ Utilizes a HF tokenizer for "optimal" vocab.
 		+ The current RVQ level is included as a token as well to help guide NAR tasks better.
 	+ This model received a few days of training on my 4xV100s, stepping up the duration window to *try* and better make the model inference for longer utterances.
 		+ Some sessions end up training the current duration window for a few epochs, but I don't know how much it affected things.
 	+ ~~However, it seems to *only* do well with long utterances. Short utterances fumble. I believe further training with a variety of durations should allow the AR to handle a variety of durations.~~
 		- ~~I believe the "slowly stepping up the context length" only works for text, and not audio.~~
         - Addendum: Additional brief training for a variety of duration lengths seemed to have mostly fixed this issue.
         - Addendum addendum: Properly creating the position IDs per-segment rather than the whole sequence, also helps a lot.
 	+ Zero-shot performance leaves a bit to be desired, as it did not receive the special training prioritizing shuffling between speakers rather than the global pool of utterances.
         - Addendum: Additional brief training for sampling based on speaker per "epoch" (per dataloader, not dataset) seemed to slightly improve it.
-	+ Testing showed that, despite also stepping up the prompt duration, it *really* likes three second prompts.
 	+ Definitely needs additional training, but the next way to go is unknown.
         + Naturally, training it on a "next RVQ level is half as likely" distribution introduces some crust as the later RVQ levels are less accurate, introducing noise and artifacts.
-        + As a fix for the above, naively training it on equally distributed RVQ levels *does* lobotomize the AR.
-        + Additional training on the AR will see huge diminishing returns, so I don't know if it's worth doing so.
     + Seems to be a decent foundation for "distillation", at the very least for LoRA training.
     	- Addendum: it seems to serve fine for patch-training a few extra tweaks, to non-unified position IDs, split classifier heads, and para-parallel decoding for the AR.
-* `config.llama-tts+stt.yaml` / `ar+nar-tts+stt-llama-8`: The above, but with partially trained for STT.
     + These weights use the above weights but with additional training for the default `tts` task and a new `stt` task (at a 3:1 ratio).
     + Initially was trained with `duration_range: [3.0, 60.0]` and `sample_shuffle: True` for a few hours, but then pivoted to my standard  `duration_range: [3.0, 12.0]` and `sample_shuffle: False`
       + Will need the former training to "undo" any issues with durations, as it usually came up before.
@@ -55,6 +62,14 @@ This repo contains the following configurations under `./models/`:
       + I first need a good mechanism to make sure I *can* extend existing weights with additional tasks, but with a simple enough task.
       + This also *maybe* seems to help bolster the initial TTS task by helping the model have a better internal state (or something to that tune).
     + STT is not perfect against voices that aren't close to a normal speaking voice (as per the dataset), unlike TTS where you can easily have "sounds close enough" and room for errors.
 Some additional configurations have been explored with, but experiments have not been fruitful:
 * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
@@ -89,21 +104,5 @@ This repo also contains some LoRAs to serve as a reference under `./loras/`.
 Using a LoRA is the same as a base model, except you're required to have the base model already (obviously). Just use the LoRA's config YAML to load from instead to use it.
-The only caveat is that my original dataset *does* contain these samples already, but given the sheer size of it, they're probably underutilized.
-* However, the base model already has *almost adequate* output from these speakers, but not enough to be satisfactory.
-* `config.lora.glados.yaml` / `lora-glados-r128-a128`:
-  + A simple LoRA of GLaDOS from both Portal and Portal 2.
-  + Trained for 250 steps (48000 samples, 821 samples per epoch).
-* `config.lora.sam.yaml` / `lora-sam-r128-a128`:
-  + A simple LoRA of Sam from the non-remaster Sam and Max Telltale games.
-  + Trained for 250 steps (48000 samples, 1555 samples per epoch).
-* `config.lora.max.yaml` / `lora-max-r128-a128`:
-  + A simple LoRA of Max from the non-remaster Sam and Max Telltale games.
-  + Trained for 250 steps (48000 samples, 1292 samples per epoch).
-* `config.lora.shodan.yaml` / `lora-shodan-r128-a128`:
-  + A simple LoRA of SHODAN from System Shock 2.
-  + This is honestly probably the hardest voice the model can attend to due to:
-    + the nature of her voice
-    + the low amount of samples
-    + the fine line between undertraining and overfitting

 	+ Prior testing showed that longer prompt durations results in better utterances.
     + *Can* benefit from additional training, but I recall the average loss being around `1.9` to `2.1`.
         + However, due to regressions (or bias from working under `llama`), I don't think I can optimially train with a RetNet again (both in terms of VRAM consumption and throughput).
+    + Currently does not seem to work anymore due to regressions in the code.
 * `config.llama.yaml` / `ar+nar-llama-8`: The most recent-ishly trained weights after learning from my mistakes.
 	+ This configuration utilizes Llama's attention-based transformer as the underlying architecture, making use of creature comforts like RoPE, GQA, and memory-efficient attention (trained under `xformers`, shouldn't really affect things).
+		+ Prompt and response embeddings ARE summed (half the model was trained without summing, but enabling it seemed to make the most sense, and it didn't affect anything to do so).
 		+ Utilizes a HF tokenizer for "optimal" vocab.
 		+ The current RVQ level is included as a token as well to help guide NAR tasks better.
 	+ This model received a few days of training on my 4xV100s, stepping up the duration window to *try* and better make the model inference for longer utterances.
 		+ Some sessions end up training the current duration window for a few epochs, but I don't know how much it affected things.
+    + This model *actually* received additional post-training for a variety of issues needed to be addressed:
+        + Training on shuffled batches of durations to have it better generalize on a variety of durations.
+        + Non-naive prompt sampling for similar utterances to try and give better prompt adherance.
+        + Additional languages (Japanese, French, and German) and an additional task: Speech-to-Text (phonemes)
+        + etc.
 	+ ~~However, it seems to *only* do well with long utterances. Short utterances fumble. I believe further training with a variety of durations should allow the AR to handle a variety of durations.~~
 		- ~~I believe the "slowly stepping up the context length" only works for text, and not audio.~~
         - Addendum: Additional brief training for a variety of duration lengths seemed to have mostly fixed this issue.
         - Addendum addendum: Properly creating the position IDs per-segment rather than the whole sequence, also helps a lot.
 	+ Zero-shot performance leaves a bit to be desired, as it did not receive the special training prioritizing shuffling between speakers rather than the global pool of utterances.
         - Addendum: Additional brief training for sampling based on speaker per "epoch" (per dataloader, not dataset) seemed to slightly improve it.
+        - Addendum addendum: non-naive prompt sampling with a similar utterance to the output helps a non-negligible amount.
+	+ Testing showed that~~, despite also stepping up the prompt duration, it *really* likes three second prompts.~~ longer input prompts does actually help.
+        + Giving a wide coverage of phonemes to directly reference goes a long way.
 	+ Definitely needs additional training, but the next way to go is unknown.
         + Naturally, training it on a "next RVQ level is half as likely" distribution introduces some crust as the later RVQ levels are less accurate, introducing noise and artifacts.
+        + Additional training on the AR will ~~see huge diminishing returns, so I don't know if it's worth doing so.~~ see slight improvements over additional epochs with differen training/sampling paradigms.
     + Seems to be a decent foundation for "distillation", at the very least for LoRA training.
     	- Addendum: it seems to serve fine for patch-training a few extra tweaks, to non-unified position IDs, split classifier heads, and para-parallel decoding for the AR.
+* ~~`config.llama-tts+stt.yaml` / `ar+nar-tts+stt-llama-8`~~: The above, but with partially trained for STT.
     + These weights use the above weights but with additional training for the default `tts` task and a new `stt` task (at a 3:1 ratio).
     + Initially was trained with `duration_range: [3.0, 60.0]` and `sample_shuffle: True` for a few hours, but then pivoted to my standard  `duration_range: [3.0, 12.0]` and `sample_shuffle: False`
       + Will need the former training to "undo" any issues with durations, as it usually came up before.
       + I first need a good mechanism to make sure I *can* extend existing weights with additional tasks, but with a simple enough task.
       + This also *maybe* seems to help bolster the initial TTS task by helping the model have a better internal state (or something to that tune).
     + STT is not perfect against voices that aren't close to a normal speaking voice (as per the dataset), unlike TTS where you can easily have "sounds close enough" and room for errors.
+    + Addendum: this replaced the `ar+nar-llama-8` as the defacto model (taking its name), so the above does apply.
+* `config.llama[layerskip].yaml` / `ar+nar-layerskip-llama-8`: The above, but with very brief training for LayerSkip:
+    + Trained on a small English subset of Emilia and a small private corpus, and Japanese+French+German from Emilia.
+    + Using shuffled batches (where each batch has the same durations) and a modified `rvq_levels_p` to help the NAR.
+    + This model received LayerSkip-aware training, with layer dropout and early-exit loss to help try and bolster the model and enable self-speculation sampling.
+    + I *need* to do heavy evaluation against the base model to ensure output quality does not drop before considering replacing the base model with this.
+    + Goal is to utilize self-speculation sampling to enable speedups when possible.
 Some additional configurations have been explored with, but experiments have not been fruitful:
 * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
 Using a LoRA is the same as a base model, except you're required to have the base model already (obviously). Just use the LoRA's config YAML to load from instead to use it.
+The only caveat is that my original dataset *does* contain (most of) these samples already, but given the sheer size of it, they're probably underutilized.
+* However, the base model already has *almost adequate* output from these speakers, but not enough to be satisfactory.