ecker commited on
Commit
e1f07f0
1 Parent(s): 4c2ad91

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -6
README.md CHANGED
@@ -70,12 +70,6 @@ Under `./models/experiments/` are some failed models, but are included to serve
70
  + Delve into other, exotic features, such as utilizing DAC's decoding embeddings (which might not be necessary at all since it seems *fine* at the moment).
71
  + Addendum: This seems unneccessary, as freezing to these embeddings is harmful, and not freezing them will just inevitably cause them to shift elsewhere.
72
 
73
- * `config.dac-nar-len.yaml` / `nar-len-llama-9`: A DAC-based model, but is a pure NAR model (+ autoregressive length task) .
74
- + Originally thought to be bunk from inferencing tests having audio drastically drop off into silence, but I suppose it was just some issue that eventually resolved itself.
75
- + Addendum: I don't know what magic I did for that model, but I cannot recreate a decent EnCodec-backed model instead, despite the test trainer working fine.
76
- + Suffers from the same problems the above model suffers from (terrible quality).
77
- + *Huge* performance gains, but may definitely suffer from some specific qualities in the outputs, if it does get trained right.
78
-
79
  * `config.llama-x4.yaml` / `ar+nar-llama-8`: The above `ar+nar-llama-8` model, but with para-parallel decoding for the AR in-post.
80
  + This mostly serves as a proof-of-concept for speeding up inferencing by reducing the number of steps required, by decoding multiple tokens in parallel with a similar approach to how the NAR decodes in parallel.
81
  + Trained with the trainer's batch-by-durations sampler for a maximum duration batch size of 100 seconds (750 resp tokens), with ProdigyOpt at bfloat16 (no AMP) on my 4070Ti (because I can't be assed to fire up my 4xV100 machine again for a simple test).
@@ -87,6 +81,10 @@ Under `./models/experiments/` are some failed models, but are included to serve
87
  Some additional configurations have been explored with, but experiments have not been fruitful:
88
  * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
89
  * Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
 
 
 
 
90
 
91
  Some current "achitectural features" are in-use, but their effects need to be experimented with further:
92
  * `split_classifier_heads` is still a mystery whether it's truly helpful or not (each RVQ level gets its own output head).
 
70
  + Delve into other, exotic features, such as utilizing DAC's decoding embeddings (which might not be necessary at all since it seems *fine* at the moment).
71
  + Addendum: This seems unneccessary, as freezing to these embeddings is harmful, and not freezing them will just inevitably cause them to shift elsewhere.
72
 
 
 
 
 
 
 
73
  * `config.llama-x4.yaml` / `ar+nar-llama-8`: The above `ar+nar-llama-8` model, but with para-parallel decoding for the AR in-post.
74
  + This mostly serves as a proof-of-concept for speeding up inferencing by reducing the number of steps required, by decoding multiple tokens in parallel with a similar approach to how the NAR decodes in parallel.
75
  + Trained with the trainer's batch-by-durations sampler for a maximum duration batch size of 100 seconds (750 resp tokens), with ProdigyOpt at bfloat16 (no AMP) on my 4070Ti (because I can't be assed to fire up my 4xV100 machine again for a simple test).
 
81
  Some additional configurations have been explored with, but experiments have not been fruitful:
82
  * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
83
  * Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
84
+ * a pure NAR (plus length predictor) cannot be realized with the current architecture.
85
+ + Transformer-based (or at least attention based) models can't seem to handle generating the initial (RVQ level 0) tokens from "thin air" (be it special tokens to repeating the input prompt).
86
+ + A diffusion-based model will definitely work, as those are good at generating from noise.
87
+ + The performance gains seem nice as the biggest "bottleneck" is the initial (RVQ level 0) AR pass, but it seems to require a lot of effort.
88
 
89
  Some current "achitectural features" are in-use, but their effects need to be experimented with further:
90
  * `split_classifier_heads` is still a mystery whether it's truly helpful or not (each RVQ level gets its own output head).