Update README.md
Browse files
README.md
CHANGED
@@ -45,6 +45,17 @@ This repo contains the following configurations under `./models/`:
|
|
45 |
+ Seems to be a decent foundation for "distillation", at the very least for LoRA training.
|
46 |
- Addendum: it seems to serve fine for patch-training a few extra tweaks, to non-unified position IDs, split classifier heads, and para-parallel decoding for the AR.
|
47 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
48 |
Some additional configurations have been explored with, but experiments have not been fruitful:
|
49 |
* Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
|
50 |
* Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
|
|
|
45 |
+ Seems to be a decent foundation for "distillation", at the very least for LoRA training.
|
46 |
- Addendum: it seems to serve fine for patch-training a few extra tweaks, to non-unified position IDs, split classifier heads, and para-parallel decoding for the AR.
|
47 |
|
48 |
+
* `config.llama-tts+stt.yaml` / `ar+nar-tts+stt-llama-8`: The above, but with partially trained for STT.
|
49 |
+
+ These weights use the above weights but with additional training for the default `tts` task and a new `stt` task (at a 3:1 ratio).
|
50 |
+
+ Initially was trained with `duration_range: [3.0, 60.0]` and `sample_shuffle: True` for a few hours, but then pivoted to my standard `duration_range: [3.0, 12.0]` and `sample_shuffle: False`
|
51 |
+
+ Will need the former training to "undo" any issues with durations, as it usually came up before.
|
52 |
+
+ `stt` task simply takes a piece of audio and outputs a transcription using IPA phonemes (that the model already is trained against for its text inputs).
|
53 |
+
+ Can be done with `--task=stt` and an empty (`""`) text input through the CLI interface or the `Speech-to-Text` tab in the web UI.
|
54 |
+
+ This mainly serves as a stepping stone before pivoting towards SpeechX tasks.
|
55 |
+
+ I first need a good mechanism to make sure I *can* extend existing weights with additional tasks, but with a simple enough task.
|
56 |
+
+ This also *maybe* seems to help bolster the initial TTS task by helping the model have a better internal state (or something to that tune).
|
57 |
+
+ STT is not perfect against voices that aren't close to a normal speaking voice (as per the dataset), unlike TTS where you can easily have "sounds close enough" and room for errors.
|
58 |
+
|
59 |
Some additional configurations have been explored with, but experiments have not been fruitful:
|
60 |
* Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
|
61 |
* Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
|