ecker
/

vall-e

Model card Files Files and versions Community

ecker commited on Sep 8

Commit

5e1ff4a

•

1 Parent(s): 822037a

Update README.md

Browse files

Files changed (1) hide show

README.md +11 -0

README.md CHANGED Viewed

@@ -45,6 +45,17 @@ This repo contains the following configurations under `./models/`:
     + Seems to be a decent foundation for "distillation", at the very least for LoRA training.
     	- Addendum: it seems to serve fine for patch-training a few extra tweaks, to non-unified position IDs, split classifier heads, and para-parallel decoding for the AR.
 Some additional configurations have been explored with, but experiments have not been fruitful:
 * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
 * Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.

     + Seems to be a decent foundation for "distillation", at the very least for LoRA training.
     	- Addendum: it seems to serve fine for patch-training a few extra tweaks, to non-unified position IDs, split classifier heads, and para-parallel decoding for the AR.
+* `config.llama-tts+stt.yaml` / `ar+nar-tts+stt-llama-8`: The above, but with partially trained for STT.
+    + These weights use the above weights but with additional training for the default `tts` task and a new `stt` task (at a 3:1 ratio).
+    + Initially was trained with `duration_range: [3.0, 60.0]` and `sample_shuffle: True` for a few hours, but then pivoted to my standard  `duration_range: [3.0, 12.0]` and `sample_shuffle: False`
+      + Will need the former training to "undo" any issues with durations, as it usually came up before.
+    + `stt` task simply takes a piece of audio and outputs a transcription using IPA phonemes (that the model already is trained against for its text inputs).
+      + Can be done with `--task=stt` and an empty (`""`) text input through the CLI interface or the `Speech-to-Text` tab in the web UI.
+    + This mainly serves as a stepping stone before pivoting towards SpeechX tasks.
+      + I first need a good mechanism to make sure I *can* extend existing weights with additional tasks, but with a simple enough task.
+      + This also *maybe* seems to help bolster the initial TTS task by helping the model have a better internal state (or something to that tune).
+    + STT is not perfect against voices that aren't close to a normal speaking voice (as per the dataset), unlike TTS where you can easily have "sounds close enough" and room for errors.
 Some additional configurations have been explored with, but experiments have not been fruitful:
 * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
 * Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.