Update README.md
Browse files
README.md
CHANGED
@@ -70,12 +70,6 @@ Under `./models/experiments/` are some failed models, but are included to serve
|
|
70 |
+ Delve into other, exotic features, such as utilizing DAC's decoding embeddings (which might not be necessary at all since it seems *fine* at the moment).
|
71 |
+ Addendum: This seems unneccessary, as freezing to these embeddings is harmful, and not freezing them will just inevitably cause them to shift elsewhere.
|
72 |
|
73 |
-
* `config.dac-nar-len.yaml` / `nar-len-llama-9`: A DAC-based model, but is a pure NAR model (+ autoregressive length task) .
|
74 |
-
+ Originally thought to be bunk from inferencing tests having audio drastically drop off into silence, but I suppose it was just some issue that eventually resolved itself.
|
75 |
-
+ Addendum: I don't know what magic I did for that model, but I cannot recreate a decent EnCodec-backed model instead, despite the test trainer working fine.
|
76 |
-
+ Suffers from the same problems the above model suffers from (terrible quality).
|
77 |
-
+ *Huge* performance gains, but may definitely suffer from some specific qualities in the outputs, if it does get trained right.
|
78 |
-
|
79 |
* `config.llama-x4.yaml` / `ar+nar-llama-8`: The above `ar+nar-llama-8` model, but with para-parallel decoding for the AR in-post.
|
80 |
+ This mostly serves as a proof-of-concept for speeding up inferencing by reducing the number of steps required, by decoding multiple tokens in parallel with a similar approach to how the NAR decodes in parallel.
|
81 |
+ Trained with the trainer's batch-by-durations sampler for a maximum duration batch size of 100 seconds (750 resp tokens), with ProdigyOpt at bfloat16 (no AMP) on my 4070Ti (because I can't be assed to fire up my 4xV100 machine again for a simple test).
|
@@ -87,6 +81,10 @@ Under `./models/experiments/` are some failed models, but are included to serve
|
|
87 |
Some additional configurations have been explored with, but experiments have not been fruitful:
|
88 |
* Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
|
89 |
* Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
|
|
|
|
|
|
|
|
|
90 |
|
91 |
Some current "achitectural features" are in-use, but their effects need to be experimented with further:
|
92 |
* `split_classifier_heads` is still a mystery whether it's truly helpful or not (each RVQ level gets its own output head).
|
|
|
70 |
+ Delve into other, exotic features, such as utilizing DAC's decoding embeddings (which might not be necessary at all since it seems *fine* at the moment).
|
71 |
+ Addendum: This seems unneccessary, as freezing to these embeddings is harmful, and not freezing them will just inevitably cause them to shift elsewhere.
|
72 |
|
|
|
|
|
|
|
|
|
|
|
|
|
73 |
* `config.llama-x4.yaml` / `ar+nar-llama-8`: The above `ar+nar-llama-8` model, but with para-parallel decoding for the AR in-post.
|
74 |
+ This mostly serves as a proof-of-concept for speeding up inferencing by reducing the number of steps required, by decoding multiple tokens in parallel with a similar approach to how the NAR decodes in parallel.
|
75 |
+ Trained with the trainer's batch-by-durations sampler for a maximum duration batch size of 100 seconds (750 resp tokens), with ProdigyOpt at bfloat16 (no AMP) on my 4070Ti (because I can't be assed to fire up my 4xV100 machine again for a simple test).
|
|
|
81 |
Some additional configurations have been explored with, but experiments have not been fruitful:
|
82 |
* Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
|
83 |
* Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
|
84 |
+
* a pure NAR (plus length predictor) cannot be realized with the current architecture.
|
85 |
+
+ Transformer-based (or at least attention based) models can't seem to handle generating the initial (RVQ level 0) tokens from "thin air" (be it special tokens to repeating the input prompt).
|
86 |
+
+ A diffusion-based model will definitely work, as those are good at generating from noise.
|
87 |
+
+ The performance gains seem nice as the biggest "bottleneck" is the initial (RVQ level 0) AR pass, but it seems to require a lot of effort.
|
88 |
|
89 |
Some current "achitectural features" are in-use, but their effects need to be experimented with further:
|
90 |
* `split_classifier_heads` is still a mystery whether it's truly helpful or not (each RVQ level gets its own output head).
|