Update README.md
Browse files
README.md
CHANGED
@@ -23,29 +23,36 @@ This repo contains the following configurations under `./models/`:
|
|
23 |
+ Prior testing showed that longer prompt durations results in better utterances.
|
24 |
+ *Can* benefit from additional training, but I recall the average loss being around `1.9` to `2.1`.
|
25 |
+ However, due to regressions (or bias from working under `llama`), I don't think I can optimially train with a RetNet again (both in terms of VRAM consumption and throughput).
|
|
|
26 |
|
27 |
* `config.llama.yaml` / `ar+nar-llama-8`: The most recent-ishly trained weights after learning from my mistakes.
|
28 |
+ This configuration utilizes Llama's attention-based transformer as the underlying architecture, making use of creature comforts like RoPE, GQA, and memory-efficient attention (trained under `xformers`, shouldn't really affect things).
|
29 |
-
+ Prompt and response embeddings
|
30 |
+ Utilizes a HF tokenizer for "optimal" vocab.
|
31 |
+ The current RVQ level is included as a token as well to help guide NAR tasks better.
|
32 |
+ This model received a few days of training on my 4xV100s, stepping up the duration window to *try* and better make the model inference for longer utterances.
|
33 |
+ Some sessions end up training the current duration window for a few epochs, but I don't know how much it affected things.
|
|
|
|
|
|
|
|
|
|
|
34 |
+ ~~However, it seems to *only* do well with long utterances. Short utterances fumble. I believe further training with a variety of durations should allow the AR to handle a variety of durations.~~
|
35 |
- ~~I believe the "slowly stepping up the context length" only works for text, and not audio.~~
|
36 |
- Addendum: Additional brief training for a variety of duration lengths seemed to have mostly fixed this issue.
|
37 |
- Addendum addendum: Properly creating the position IDs per-segment rather than the whole sequence, also helps a lot.
|
38 |
+ Zero-shot performance leaves a bit to be desired, as it did not receive the special training prioritizing shuffling between speakers rather than the global pool of utterances.
|
39 |
- Addendum: Additional brief training for sampling based on speaker per "epoch" (per dataloader, not dataset) seemed to slightly improve it.
|
40 |
-
|
|
|
|
|
41 |
+ Definitely needs additional training, but the next way to go is unknown.
|
42 |
+ Naturally, training it on a "next RVQ level is half as likely" distribution introduces some crust as the later RVQ levels are less accurate, introducing noise and artifacts.
|
43 |
-
+
|
44 |
-
+ Additional training on the AR will see huge diminishing returns, so I don't know if it's worth doing so.
|
45 |
+ Seems to be a decent foundation for "distillation", at the very least for LoRA training.
|
46 |
- Addendum: it seems to serve fine for patch-training a few extra tweaks, to non-unified position IDs, split classifier heads, and para-parallel decoding for the AR.
|
47 |
|
48 |
-
*
|
49 |
+ These weights use the above weights but with additional training for the default `tts` task and a new `stt` task (at a 3:1 ratio).
|
50 |
+ Initially was trained with `duration_range: [3.0, 60.0]` and `sample_shuffle: True` for a few hours, but then pivoted to my standard `duration_range: [3.0, 12.0]` and `sample_shuffle: False`
|
51 |
+ Will need the former training to "undo" any issues with durations, as it usually came up before.
|
@@ -55,6 +62,14 @@ This repo contains the following configurations under `./models/`:
|
|
55 |
+ I first need a good mechanism to make sure I *can* extend existing weights with additional tasks, but with a simple enough task.
|
56 |
+ This also *maybe* seems to help bolster the initial TTS task by helping the model have a better internal state (or something to that tune).
|
57 |
+ STT is not perfect against voices that aren't close to a normal speaking voice (as per the dataset), unlike TTS where you can easily have "sounds close enough" and room for errors.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
58 |
|
59 |
Some additional configurations have been explored with, but experiments have not been fruitful:
|
60 |
* Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
|
@@ -89,21 +104,5 @@ This repo also contains some LoRAs to serve as a reference under `./loras/`.
|
|
89 |
|
90 |
Using a LoRA is the same as a base model, except you're required to have the base model already (obviously). Just use the LoRA's config YAML to load from instead to use it.
|
91 |
|
92 |
-
The only caveat is that my original dataset *does* contain these samples already, but given the sheer size of it, they're probably underutilized.
|
93 |
-
* However, the base model already has *almost adequate* output from these speakers, but not enough to be satisfactory.
|
94 |
-
|
95 |
-
* `config.lora.glados.yaml` / `lora-glados-r128-a128`:
|
96 |
-
+ A simple LoRA of GLaDOS from both Portal and Portal 2.
|
97 |
-
+ Trained for 250 steps (48000 samples, 821 samples per epoch).
|
98 |
-
* `config.lora.sam.yaml` / `lora-sam-r128-a128`:
|
99 |
-
+ A simple LoRA of Sam from the non-remaster Sam and Max Telltale games.
|
100 |
-
+ Trained for 250 steps (48000 samples, 1555 samples per epoch).
|
101 |
-
* `config.lora.max.yaml` / `lora-max-r128-a128`:
|
102 |
-
+ A simple LoRA of Max from the non-remaster Sam and Max Telltale games.
|
103 |
-
+ Trained for 250 steps (48000 samples, 1292 samples per epoch).
|
104 |
-
* `config.lora.shodan.yaml` / `lora-shodan-r128-a128`:
|
105 |
-
+ A simple LoRA of SHODAN from System Shock 2.
|
106 |
-
+ This is honestly probably the hardest voice the model can attend to due to:
|
107 |
-
+ the nature of her voice
|
108 |
-
+ the low amount of samples
|
109 |
-
+ the fine line between undertraining and overfitting
|
|
|
23 |
+ Prior testing showed that longer prompt durations results in better utterances.
|
24 |
+ *Can* benefit from additional training, but I recall the average loss being around `1.9` to `2.1`.
|
25 |
+ However, due to regressions (or bias from working under `llama`), I don't think I can optimially train with a RetNet again (both in terms of VRAM consumption and throughput).
|
26 |
+
+ Currently does not seem to work anymore due to regressions in the code.
|
27 |
|
28 |
* `config.llama.yaml` / `ar+nar-llama-8`: The most recent-ishly trained weights after learning from my mistakes.
|
29 |
+ This configuration utilizes Llama's attention-based transformer as the underlying architecture, making use of creature comforts like RoPE, GQA, and memory-efficient attention (trained under `xformers`, shouldn't really affect things).
|
30 |
+
+ Prompt and response embeddings ARE summed (half the model was trained without summing, but enabling it seemed to make the most sense, and it didn't affect anything to do so).
|
31 |
+ Utilizes a HF tokenizer for "optimal" vocab.
|
32 |
+ The current RVQ level is included as a token as well to help guide NAR tasks better.
|
33 |
+ This model received a few days of training on my 4xV100s, stepping up the duration window to *try* and better make the model inference for longer utterances.
|
34 |
+ Some sessions end up training the current duration window for a few epochs, but I don't know how much it affected things.
|
35 |
+
+ This model *actually* received additional post-training for a variety of issues needed to be addressed:
|
36 |
+
+ Training on shuffled batches of durations to have it better generalize on a variety of durations.
|
37 |
+
+ Non-naive prompt sampling for similar utterances to try and give better prompt adherance.
|
38 |
+
+ Additional languages (Japanese, French, and German) and an additional task: Speech-to-Text (phonemes)
|
39 |
+
+ etc.
|
40 |
+ ~~However, it seems to *only* do well with long utterances. Short utterances fumble. I believe further training with a variety of durations should allow the AR to handle a variety of durations.~~
|
41 |
- ~~I believe the "slowly stepping up the context length" only works for text, and not audio.~~
|
42 |
- Addendum: Additional brief training for a variety of duration lengths seemed to have mostly fixed this issue.
|
43 |
- Addendum addendum: Properly creating the position IDs per-segment rather than the whole sequence, also helps a lot.
|
44 |
+ Zero-shot performance leaves a bit to be desired, as it did not receive the special training prioritizing shuffling between speakers rather than the global pool of utterances.
|
45 |
- Addendum: Additional brief training for sampling based on speaker per "epoch" (per dataloader, not dataset) seemed to slightly improve it.
|
46 |
+
- Addendum addendum: non-naive prompt sampling with a similar utterance to the output helps a non-negligible amount.
|
47 |
+
+ Testing showed that~~, despite also stepping up the prompt duration, it *really* likes three second prompts.~~ longer input prompts does actually help.
|
48 |
+
+ Giving a wide coverage of phonemes to directly reference goes a long way.
|
49 |
+ Definitely needs additional training, but the next way to go is unknown.
|
50 |
+ Naturally, training it on a "next RVQ level is half as likely" distribution introduces some crust as the later RVQ levels are less accurate, introducing noise and artifacts.
|
51 |
+
+ Additional training on the AR will ~~see huge diminishing returns, so I don't know if it's worth doing so.~~ see slight improvements over additional epochs with differen training/sampling paradigms.
|
|
|
52 |
+ Seems to be a decent foundation for "distillation", at the very least for LoRA training.
|
53 |
- Addendum: it seems to serve fine for patch-training a few extra tweaks, to non-unified position IDs, split classifier heads, and para-parallel decoding for the AR.
|
54 |
|
55 |
+
* ~~`config.llama-tts+stt.yaml` / `ar+nar-tts+stt-llama-8`~~: The above, but with partially trained for STT.
|
56 |
+ These weights use the above weights but with additional training for the default `tts` task and a new `stt` task (at a 3:1 ratio).
|
57 |
+ Initially was trained with `duration_range: [3.0, 60.0]` and `sample_shuffle: True` for a few hours, but then pivoted to my standard `duration_range: [3.0, 12.0]` and `sample_shuffle: False`
|
58 |
+ Will need the former training to "undo" any issues with durations, as it usually came up before.
|
|
|
62 |
+ I first need a good mechanism to make sure I *can* extend existing weights with additional tasks, but with a simple enough task.
|
63 |
+ This also *maybe* seems to help bolster the initial TTS task by helping the model have a better internal state (or something to that tune).
|
64 |
+ STT is not perfect against voices that aren't close to a normal speaking voice (as per the dataset), unlike TTS where you can easily have "sounds close enough" and room for errors.
|
65 |
+
+ Addendum: this replaced the `ar+nar-llama-8` as the defacto model (taking its name), so the above does apply.
|
66 |
+
|
67 |
+
* `config.llama[layerskip].yaml` / `ar+nar-layerskip-llama-8`: The above, but with very brief training for LayerSkip:
|
68 |
+
+ Trained on a small English subset of Emilia and a small private corpus, and Japanese+French+German from Emilia.
|
69 |
+
+ Using shuffled batches (where each batch has the same durations) and a modified `rvq_levels_p` to help the NAR.
|
70 |
+
+ This model received LayerSkip-aware training, with layer dropout and early-exit loss to help try and bolster the model and enable self-speculation sampling.
|
71 |
+
+ I *need* to do heavy evaluation against the base model to ensure output quality does not drop before considering replacing the base model with this.
|
72 |
+
+ Goal is to utilize self-speculation sampling to enable speedups when possible.
|
73 |
|
74 |
Some additional configurations have been explored with, but experiments have not been fruitful:
|
75 |
* Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
|
|
|
104 |
|
105 |
Using a LoRA is the same as a base model, except you're required to have the base model already (obviously). Just use the LoRA's config YAML to load from instead to use it.
|
106 |
|
107 |
+
The only caveat is that my original dataset *does* contain (most of) these samples already, but given the sheer size of it, they're probably underutilized.
|
108 |
+
* However, the base model already has *almost adequate* output from these speakers, but not enough to be satisfactory.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|