ecker
/

vall-e

Model card Files Files and versions Community

ecker commited on Aug 4

Commit

e1f07f0

•

1 Parent(s): 4c2ad91

Update README.md

Browse files

Files changed (1) hide show

README.md +4 -6

README.md CHANGED Viewed

@@ -70,12 +70,6 @@ Under `./models/experiments/` are some failed models, but are included to serve
         + Delve into other, exotic features, such as utilizing DAC's decoding embeddings (which might not be necessary at all since it seems *fine* at the moment).
         	+ Addendum: This seems unneccessary, as freezing to these embeddings is harmful, and not freezing them will just inevitably cause them to shift elsewhere.
-* `config.dac-nar-len.yaml` / `nar-len-llama-9`: A DAC-based model, but is a pure NAR model (+ autoregressive length task) .
-	+ Originally thought to be bunk from inferencing tests having audio drastically drop off into silence, but I suppose it was just some issue that eventually resolved itself.
-      + Addendum: I don't know what magic I did for that model, but I cannot recreate a decent EnCodec-backed model instead, despite the test trainer working fine.
-	+ Suffers from the same problems the above model suffers from (terrible quality).
-	+ *Huge* performance gains, but may definitely suffer from some specific qualities in the outputs, if it does get trained right.
 * `config.llama-x4.yaml` / `ar+nar-llama-8`: The above `ar+nar-llama-8` model, but with para-parallel decoding for the AR in-post.
 	+ This mostly serves as a proof-of-concept for speeding up inferencing by reducing the number of steps required, by decoding multiple tokens in parallel with a similar approach to how the NAR decodes in parallel.
 		+ Trained with the trainer's batch-by-durations sampler for a maximum duration batch size of 100 seconds (750 resp tokens), with ProdigyOpt at bfloat16 (no AMP) on my 4070Ti (because I can't be assed to fire up my 4xV100 machine again for a simple test).
@@ -87,6 +81,10 @@ Under `./models/experiments/` are some failed models, but are included to serve
 Some additional configurations have been explored with, but experiments have not been fruitful:
 * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
 * Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
 Some current "achitectural features" are in-use, but their effects need to be experimented with further:
 * `split_classifier_heads` is still a mystery whether it's truly helpful or not (each RVQ level gets its own output head).

         + Delve into other, exotic features, such as utilizing DAC's decoding embeddings (which might not be necessary at all since it seems *fine* at the moment).
         	+ Addendum: This seems unneccessary, as freezing to these embeddings is harmful, and not freezing them will just inevitably cause them to shift elsewhere.
 * `config.llama-x4.yaml` / `ar+nar-llama-8`: The above `ar+nar-llama-8` model, but with para-parallel decoding for the AR in-post.
 	+ This mostly serves as a proof-of-concept for speeding up inferencing by reducing the number of steps required, by decoding multiple tokens in parallel with a similar approach to how the NAR decodes in parallel.
 		+ Trained with the trainer's batch-by-durations sampler for a maximum duration batch size of 100 seconds (750 resp tokens), with ProdigyOpt at bfloat16 (no AMP) on my 4070Ti (because I can't be assed to fire up my 4xV100 machine again for a simple test).
 Some additional configurations have been explored with, but experiments have not been fruitful:
 * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
 * Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
+* a pure NAR (plus length predictor) cannot be realized with the current architecture.
+  + Transformer-based (or at least attention based) models can't seem to handle generating the initial (RVQ level 0) tokens from "thin air" (be it special tokens to repeating the input prompt).
+  + A diffusion-based model will definitely work, as those are good at generating from noise.
+  + The performance gains seem nice as the biggest "bottleneck" is the initial (RVQ level 0) AR pass, but it seems to require a lot of effort.
 Some current "achitectural features" are in-use, but their effects need to be experimented with further:
 * `split_classifier_heads` is still a mystery whether it's truly helpful or not (each RVQ level gets its own output head).