ecker
/

vall-e

Model card Files Files and versions Community

ecker commited on Aug 10

Commit

fd4aef5

•

1 Parent(s): e626434

Update README.md

Browse files

RIP the DAC dream

Files changed (1) hide show

README.md +4 -2

README.md CHANGED Viewed

@@ -57,13 +57,15 @@ Some additional configurations have been explored with, but experiments have not
   + the 44KHz model was erroneously assumed to be an even 44KHz, when in reality it's 44.1KHz. *All* of my audio has to be requantized, as there's some stuttering in it.
   	+ Because of this, training losses are high and it's having a hard time trying to converge.
   + It has *sub-servicable* output for the first 4 RVQ levels, but it's massive cope to try and use it as a model.
-  + I believe there's hope to use it when I requantize my audio properly.
 * a model with a causal size >1 (sampling more than one token for the AR):
-  + re-using an exisitng model or training from scratch does not have fruitful results.
   + there's an inherent periodic stutter that doesn't seem to be able to be trained out, but it might require exotic sampling methods.
   + unfortunately it requires:
     + either something similar to Medusa heads, where there's additional parameters to perform speculative sampling,
     + a solution similar to what VALL-E 2 uses with group token embeddings or whatever, which *will* harm the NAR tasks in an AR+NAR model.
 Some current "achitectural features" are in-use, but their effects need to be experimented with further:
 * `split_classifier_heads` is still a mystery whether it's truly helpful or not (each RVQ level gets its own output head).

   + the 44KHz model was erroneously assumed to be an even 44KHz, when in reality it's 44.1KHz. *All* of my audio has to be requantized, as there's some stuttering in it.
   	+ Because of this, training losses are high and it's having a hard time trying to converge.
   + It has *sub-servicable* output for the first 4 RVQ levels, but it's massive cope to try and use it as a model.
+  + ~~I believe there's hope to use it when I requantize my audio properly.~~
+    + Addendum: even after properly processing my audio, the loss is actually *worse* than before. I imagine DAC just cannot be used as an intermediary for an LM.
 * a model with a causal size >1 (sampling more than one token for the AR):
+  + re-using an existing model or training from scratch does not have fruitful results.
   + there's an inherent periodic stutter that doesn't seem to be able to be trained out, but it might require exotic sampling methods.
   + unfortunately it requires:
     + either something similar to Medusa heads, where there's additional parameters to perform speculative sampling,
     + a solution similar to what VALL-E 2 uses with group token embeddings or whatever, which *will* harm the NAR tasks in an AR+NAR model.
+  + I just don't understand where the issue lies, since parallel decoding does work, as evidence with the NAR.
 Some current "achitectural features" are in-use, but their effects need to be experimented with further:
 * `split_classifier_heads` is still a mystery whether it's truly helpful or not (each RVQ level gets its own output head).