mrq commited on
Commit
e36dabd
Β·
1 Parent(s): fec1f3b
README.md CHANGED
@@ -10,7 +10,7 @@ To reiterate, this is ***by no means*** complete. I am not passing this off as c
10
 
11
  ## Models
12
 
13
- This repo contains the following configurations:
14
 
15
  * `config.retnet.yaml` / `ar+nar-retnet-8`: The previously released weights.
16
  + This configuration utilizes a RetNet (retention based "transformer") as the underlying architecture due to a number of misleading interpretations with comparisons, for better or for worse.
@@ -34,20 +34,26 @@ This repo contains the following configurations:
34
  + ~~However, it seems to *only* do well with long utterances. Short utterances fumble. I believe further training with a variety of durations should allow the AR to handle a variety of durations.~~
35
  - ~~I believe the "slowly stepping up the context length" only works for text, and not audio.~~
36
  - Addendum: Additional brief training for a variety of duration lengths seemed to have mostly fixed this issue.
 
37
  + Zero-shot performance leaves a bit to be desired, as it did not receive the special training prioritizing shuffling between speakers rather than the global pool of utterances.
38
  - Addendum: Additional brief training for sampling based on speaker per "epoch" (per dataloader, not dataset) seemed to slightly improve it.
39
  + Testing showed that, despite also stepping up the prompt duration, it *really* likes three second prompts.
40
  + Definitely needs additional training, but the next way to go is unknown.
41
  + Naturally, training it on a "next RVQ level is half as likely" distribution introduces some crust as the later RVQ levels are less accurate, introducing noise and artifacts.
42
- + Naively training it on equally distributed RVQ levels *does* lobotomize the AR.
43
  + Additional training on the AR will see huge diminishing returns, so I don't know if it's worth doing so.
44
  + Seems to be a decent foundation for "distillation", at the very least for LoRA training.
 
 
 
 
 
45
 
46
  * `config.llama.split.yaml` / `ar-llama-1` + `nar-llama-8`: The above model, but split and trained a little bit more.
47
  + This experiment is to see whether the AR and NAR benefitted from being split up after enough pretraining, to un-"lobotomize" any penalties from attending to two different tasks (as the AR predicts the next token, and the NAR predicts the same token but a different level).
48
  + I believe I trained each separate model an additional extra day for another additional audio-duration window for similar training lengths.
49
  + ~~I don't think audio quality differs a non-trivial amount to warrant splitting the model.~~
50
- - From recent experiments, it does seem a NAR-only model is beneficial.
51
 
52
  * `config.dac.yaml` / `ar+nar-dac-llama-9`: Utilizes [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec/) instead as the audio backend.
53
  + This utilizies the 44KHz (erroneously at 44,000 Hz instead of 44,100 Hz) model at 9 RVQ levels (majorly trained at 8, then the 9th was included).
@@ -55,20 +61,29 @@ This repo contains the following configurations:
55
  + Later experimented with the 24Khz model, but training would *always* diverge.
56
  + *Heavily* benefits from inferencing only the first four RVQ levels; levels afterwards includes far too much noise in the final output.
57
  + I imagine the nature of DAC itself amplifies errors in the remaining RVQ levels (either due to less resilliency to errors in the codes, or each RVQ level affecting hte final waveform more).
 
58
  + Has not received as much training as the EnCodec-based models.
59
  + Because of this, performance leaves more to be desired.
60
  + Further experimentation is needed, but the next approach is unknown.
61
  + Train a NAR only model to help bolster the remaining RVQ levels (outputted utterances seem a bit sluggish).
62
  + Continue training the AR+NAR to try and bolster the AR tasks (as it's quite lacking at the moment).
63
  + Delve into other, exotic features, such as utilizing DAC's decoding embeddings (which might not be necessary at all since it seems *fine* at the moment).
 
64
 
65
- Some additional configurations have been explored with, but experiments have not been fruitful:
 
 
 
66
 
67
- * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow.
 
 
 
 
 
 
 
68
 
69
- * Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model.
70
 
71
- * A NAR only model has been experimented with, but seemed utterly useless in practice.
72
- + The underlying architecture will query the model for the duration, and then inference *all* RVQ levels in parallel (one level at a time).
73
- + Despite working in the overfitting test trainer and decent training metrics, inferencing will have the model fall completely flat.
74
- + I have zero ideas for which path to go with for further experimentation.
 
10
 
11
  ## Models
12
 
13
+ This repo contains the following configurations under `./models/`:
14
 
15
  * `config.retnet.yaml` / `ar+nar-retnet-8`: The previously released weights.
16
  + This configuration utilizes a RetNet (retention based "transformer") as the underlying architecture due to a number of misleading interpretations with comparisons, for better or for worse.
 
34
  + ~~However, it seems to *only* do well with long utterances. Short utterances fumble. I believe further training with a variety of durations should allow the AR to handle a variety of durations.~~
35
  - ~~I believe the "slowly stepping up the context length" only works for text, and not audio.~~
36
  - Addendum: Additional brief training for a variety of duration lengths seemed to have mostly fixed this issue.
37
+ - Addendum addendum: Properly creating the position IDs per-segment rather than the whole sequence, also helps a lot.
38
  + Zero-shot performance leaves a bit to be desired, as it did not receive the special training prioritizing shuffling between speakers rather than the global pool of utterances.
39
  - Addendum: Additional brief training for sampling based on speaker per "epoch" (per dataloader, not dataset) seemed to slightly improve it.
40
  + Testing showed that, despite also stepping up the prompt duration, it *really* likes three second prompts.
41
  + Definitely needs additional training, but the next way to go is unknown.
42
  + Naturally, training it on a "next RVQ level is half as likely" distribution introduces some crust as the later RVQ levels are less accurate, introducing noise and artifacts.
43
+ + As a fix for the above, naively training it on equally distributed RVQ levels *does* lobotomize the AR.
44
  + Additional training on the AR will see huge diminishing returns, so I don't know if it's worth doing so.
45
  + Seems to be a decent foundation for "distillation", at the very least for LoRA training.
46
+ - Addendum: it seems to serve fine for patch-training a few extra tweaks, to non-unified position IDs, split classifier heads, and para-parallel decoding for the AR.
47
+
48
+ ## Experiments
49
+
50
+ Under `./models/experiments/` are some failed models, but are included to serve as references for my errors. Do ***not*** use them unless you're curious, or know what you're doing.
51
 
52
  * `config.llama.split.yaml` / `ar-llama-1` + `nar-llama-8`: The above model, but split and trained a little bit more.
53
  + This experiment is to see whether the AR and NAR benefitted from being split up after enough pretraining, to un-"lobotomize" any penalties from attending to two different tasks (as the AR predicts the next token, and the NAR predicts the same token but a different level).
54
  + I believe I trained each separate model an additional extra day for another additional audio-duration window for similar training lengths.
55
  + ~~I don't think audio quality differs a non-trivial amount to warrant splitting the model.~~
56
+ - Addendum: From recent experiments, it does seem a NAR-only model is beneficial; I will need to explore this in the future.
57
 
58
  * `config.dac.yaml` / `ar+nar-dac-llama-9`: Utilizes [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec/) instead as the audio backend.
59
  + This utilizies the 44KHz (erroneously at 44,000 Hz instead of 44,100 Hz) model at 9 RVQ levels (majorly trained at 8, then the 9th was included).
 
61
  + Later experimented with the 24Khz model, but training would *always* diverge.
62
  + *Heavily* benefits from inferencing only the first four RVQ levels; levels afterwards includes far too much noise in the final output.
63
  + I imagine the nature of DAC itself amplifies errors in the remaining RVQ levels (either due to less resilliency to errors in the codes, or each RVQ level affecting hte final waveform more).
64
+ + Addendum: restricting to the first four RVQ levels seems to help remove noisy artifacts, but quality is hindered as there's still less RVQ levels to rely on.
65
  + Has not received as much training as the EnCodec-based models.
66
  + Because of this, performance leaves more to be desired.
67
  + Further experimentation is needed, but the next approach is unknown.
68
  + Train a NAR only model to help bolster the remaining RVQ levels (outputted utterances seem a bit sluggish).
69
  + Continue training the AR+NAR to try and bolster the AR tasks (as it's quite lacking at the moment).
70
  + Delve into other, exotic features, such as utilizing DAC's decoding embeddings (which might not be necessary at all since it seems *fine* at the moment).
71
+ + Addendum: This seems unneccessary, as freezing to these embeddings is harmful, and not freezing them will just inevitably cause them to shift elsewhere.
72
 
73
+ * `config.dac-nar-len.yaml` / `nar-len-llama-9`: A DAC-based model, but is a pure NAR model (+ autoregressive length task) .
74
+ + Originally thought to be bunk from inferencing tests having audio drastically drop off into silence, but I suppose it was just some issue that eventually resolved itself.
75
+ + Suffers from the same problems the above model suffers from (terrible quality).
76
+ + *Huge* performance gains, but may definitely suffer from some specific qualities in the outputs, if it does get trained right.
77
 
78
+ * `config.llama-x4.yaml` / `ar+nar-llama-8`: The above `ar+nar-llama-8` model, but with para-parallel decoding for the AR in-post.
79
+ + This mostly serves as a proof-of-concept for speeding up inferencing by reducing the number of steps required, by decoding multiple tokens in parallel with a similar approach to how the NAR decodes in parallel.
80
+ + Trained with the trainer's batch-by-durations sampler for a maximum duration batch size of 100 seconds (750 resp tokens), with ProdigyOpt at bfloat16 (no AMP) on my 4070Ti (because I can't be assed to fire up my 4xV100 machine again for a simple test).
81
+ + The model definitely needs to be retrained as there's some errors for the additional tokens.
82
+ + If these cannot be nailed out with more training, then I imagine a similar approach to speculative decoding where the nth tokens are discarded if the confidence is low.
83
+ + Greedy sampling might be beneficial instead for this, as the NAR does benefit greatly from low temperatures / greedy sampling.
84
+
85
+ Some additional configurations have been explored with, but experiments have not been fruitful:
86
 
87
+ * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
88
 
89
+ * Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
 
 
 
{model β†’ models}/ckpt/ar+nar-llama-8/fp32.pth RENAMED
File without changes
{model β†’ models}/ckpt/ar+nar-retnet-8/fp32.pth RENAMED
File without changes
{model β†’ models}/config.llama.yaml RENAMED
File without changes
{model β†’ models}/config.retnet.yaml RENAMED
File without changes
{model β†’ models/experiments}/ckpt/ar+nar-dac-llama-9/ckpt/fp32.pth RENAMED
File without changes
{model β†’ models/experiments}/ckpt/ar-llama-1/fp32.pth RENAMED
File without changes
{model β†’ models/experiments}/ckpt/nar-llama-8/fp32.pth RENAMED
File without changes
models/experiments/config.dac-nar-len.yaml ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ sample_rate: 44_000
2
+ audio_backend: "dac"
3
+
4
+ models:
5
+ - name: "nar-len"
6
+ size:
7
+ audio_tokens: 1024
8
+ text_tokens: 256
9
+ dim: 1024
10
+ heads: 16
11
+ layers: 16
12
+ resp_levels: 9
13
+ prom_levels: 9
14
+ tasks: 8
15
+ langs: 2
16
+ tones: 1
17
+ arch_type: llama
18
+ training: True
19
+ version: 5
20
+ attention: flash_attention_2
21
+ dropout: 0.1
22
+ #loss_factors:
23
+ # text: 0.01
24
+ # prom: 0.5
25
+ # resp: 1.0
26
+ # len: 1.0
27
+ capabilities: ["nar", "len"]
28
+ experimental:
29
+ audio_embedding_sums: False
30
+ interleave: False
31
+ unified_position_ids: True
32
+ rvq_level_range: []
33
+ split_classifiers: True
34
+ tie_classifier_to_embedding: False
35
+
36
+ #loras:
37
+ #- name : "lora-test"
38
+ # rank: 128
39
+ # alpha: 128
40
+ # training: True
41
+ # rvq_levels: []
42
+
43
+ hyperparameters:
44
+ batch_size: 16
45
+ gradient_accumulation_steps: 4
46
+ gradient_clipping: 1.0
47
+ warmup_steps: 10
48
+
49
+ optimizer: Prodigy
50
+ learning_rate: 1.0
51
+ torch_optimizer: True
52
+
53
+ scheduler: "" # ScheduleFree
54
+ torch_scheduler: True
55
+
56
+ evaluation:
57
+ batch_size: 4
58
+ frequency: 250
59
+ size: 4
60
+
61
+ steps: 500
62
+ ar_temperature: 1.0
63
+ nar_temperature: 0.0
64
+
65
+ trainer:
66
+ iterations: 1_000_000
67
+ save_frequency: 250
68
+ keep_last_checkpoints: 4
69
+
70
+ check_for_oom: False
71
+ gradient_checkpointing: False
72
+
73
+ weight_dtype: bfloat16
74
+ amp: False
75
+
76
+ backend: deepspeed
77
+ deepspeed:
78
+ inferencing: False
79
+ amp: False
80
+
81
+ load_webui: False
82
+
83
+ inference:
84
+ backend: local
85
+ normalize: False
86
+
87
+ weight_dtype: bfloat16
88
+ amp: False
89
+
90
+ optimizations:
91
+ injects: False
92
+ replace: True
93
+
94
+ linear: False
95
+ embedding: False
96
+ optimizers: True
97
+
98
+ bitsandbytes: False
99
+ dadaptation: False
100
+ bitnet: False
101
+ fp8: False
102
+
103
+ dataset:
104
+ speaker_name_getter: "lambda p: f'{p.parts[-3]}_{p.parts[-2]}'"
105
+ speaker_group_getter: "lambda p: f'{p.parts[-3]}'"
106
+
107
+ use_hdf5: True
108
+ hdf5_flag: r
109
+
110
+ use_metadata: True
111
+ validate: True
112
+
113
+ workers: 1
114
+ cache: False
115
+
116
+ duration_range: [3.0, 24.0]
117
+
118
+ random_utterance: 1.0
119
+ max_prompts: 1
120
+ prompt_duration_range: [3.0, 3.0]
121
+
122
+ max_resps: 1
123
+ p_resp_append: 0.25
124
+
125
+ sample_type: path # path # speaker
126
+ sample_order: duration
127
+ sample_max_duration_batch: 100
128
+
129
+ tasks_list: [ "tts" ] #, "tts-c", "ns", "sr" ]
130
+
131
+ training: []
132
+ validation: []
133
+ noise: []
{model β†’ models/experiments}/config.dac.yaml RENAMED
File without changes
{model β†’ models/experiments}/config.llama-split.yaml RENAMED
File without changes
old/.cache.tar.gz DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:993b57269f1d012c9f5b46372a1f2320e75872da177f7b2071a38578a2ca166b
3
- size 16020894
 
 
 
 
old/.gitattributes DELETED
@@ -1,35 +0,0 @@
1
- *.7z filter=lfs diff=lfs merge=lfs -text
2
- *.arrow filter=lfs diff=lfs merge=lfs -text
3
- *.bin filter=lfs diff=lfs merge=lfs -text
4
- *.bz2 filter=lfs diff=lfs merge=lfs -text
5
- *.ckpt filter=lfs diff=lfs merge=lfs -text
6
- *.ftz filter=lfs diff=lfs merge=lfs -text
7
- *.gz filter=lfs diff=lfs merge=lfs -text
8
- *.h5 filter=lfs diff=lfs merge=lfs -text
9
- *.joblib filter=lfs diff=lfs merge=lfs -text
10
- *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
- *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
- *.model filter=lfs diff=lfs merge=lfs -text
13
- *.msgpack filter=lfs diff=lfs merge=lfs -text
14
- *.npy filter=lfs diff=lfs merge=lfs -text
15
- *.npz filter=lfs diff=lfs merge=lfs -text
16
- *.onnx filter=lfs diff=lfs merge=lfs -text
17
- *.ot filter=lfs diff=lfs merge=lfs -text
18
- *.parquet filter=lfs diff=lfs merge=lfs -text
19
- *.pb filter=lfs diff=lfs merge=lfs -text
20
- *.pickle filter=lfs diff=lfs merge=lfs -text
21
- *.pkl filter=lfs diff=lfs merge=lfs -text
22
- *.pt filter=lfs diff=lfs merge=lfs -text
23
- *.pth filter=lfs diff=lfs merge=lfs -text
24
- *.rar filter=lfs diff=lfs merge=lfs -text
25
- *.safetensors filter=lfs diff=lfs merge=lfs -text
26
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
- *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
- *.tflite filter=lfs diff=lfs merge=lfs -text
30
- *.tgz filter=lfs diff=lfs merge=lfs -text
31
- *.wasm filter=lfs diff=lfs merge=lfs -text
32
- *.xz filter=lfs diff=lfs merge=lfs -text
33
- *.zip filter=lfs diff=lfs merge=lfs -text
34
- *.zst filter=lfs diff=lfs merge=lfs -text
35
- *tfevents* filter=lfs diff=lfs merge=lfs -text