Update README.md
Browse files
README.md
CHANGED
@@ -72,6 +72,7 @@ Under `./models/experiments/` are some failed models, but are included to serve
|
|
72 |
|
73 |
* `config.dac-nar-len.yaml` / `nar-len-llama-9`: A DAC-based model, but is a pure NAR model (+ autoregressive length task) .
|
74 |
+ Originally thought to be bunk from inferencing tests having audio drastically drop off into silence, but I suppose it was just some issue that eventually resolved itself.
|
|
|
75 |
+ Suffers from the same problems the above model suffers from (terrible quality).
|
76 |
+ *Huge* performance gains, but may definitely suffer from some specific qualities in the outputs, if it does get trained right.
|
77 |
|
@@ -81,9 +82,13 @@ Under `./models/experiments/` are some failed models, but are included to serve
|
|
81 |
+ The model definitely needs to be retrained as there's some errors for the additional tokens.
|
82 |
+ If these cannot be nailed out with more training, then I imagine a similar approach to speculative decoding where the nth tokens are discarded if the confidence is low.
|
83 |
+ Greedy sampling might be beneficial instead for this, as the NAR does benefit greatly from low temperatures / greedy sampling.
|
|
|
84 |
|
85 |
Some additional configurations have been explored with, but experiments have not been fruitful:
|
86 |
-
|
87 |
* Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
|
|
|
88 |
|
89 |
-
|
|
|
|
|
|
|
|
72 |
|
73 |
* `config.dac-nar-len.yaml` / `nar-len-llama-9`: A DAC-based model, but is a pure NAR model (+ autoregressive length task) .
|
74 |
+ Originally thought to be bunk from inferencing tests having audio drastically drop off into silence, but I suppose it was just some issue that eventually resolved itself.
|
75 |
+
+ Addendum: I don't know what magic I did for that model, but I cannot recreate a decent EnCodec-backed model instead, despite the test trainer working fine.
|
76 |
+ Suffers from the same problems the above model suffers from (terrible quality).
|
77 |
+ *Huge* performance gains, but may definitely suffer from some specific qualities in the outputs, if it does get trained right.
|
78 |
|
|
|
82 |
+ The model definitely needs to be retrained as there's some errors for the additional tokens.
|
83 |
+ If these cannot be nailed out with more training, then I imagine a similar approach to speculative decoding where the nth tokens are discarded if the confidence is low.
|
84 |
+ Greedy sampling might be beneficial instead for this, as the NAR does benefit greatly from low temperatures / greedy sampling.
|
85 |
+
+ It seems naively just adjusting the "causal size" (amount of tokens to predict into the future, and in turn, how many tokens are returned per step) introduces crackles at fixed intervals.
|
86 |
|
87 |
Some additional configurations have been explored with, but experiments have not been fruitful:
|
|
|
88 |
* Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
|
89 |
+
* Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
|
90 |
|
91 |
+
Some current "achitectural features" are in-use, but their effects need to be experimented with further:
|
92 |
+
* `split_classifier_heads` is still a mystery whether it's truly helpful or not (each RVQ level gets its own output head).
|
93 |
+
* `audio_embeddings_sum` is also a mystery whether it matters if each later RVQ level should "see" the past levels through summing embeddings, or if not doing it is preferable.
|
94 |
+
* Disabling `unified_position_ids` seems to help quality more often than not, but I'm still unsure if it's beneficial in practice.
|