Update README.md
Browse filesRIP the DAC dream
README.md
CHANGED
@@ -57,13 +57,15 @@ Some additional configurations have been explored with, but experiments have not
|
|
57 |
+ the 44KHz model was erroneously assumed to be an even 44KHz, when in reality it's 44.1KHz. *All* of my audio has to be requantized, as there's some stuttering in it.
|
58 |
+ Because of this, training losses are high and it's having a hard time trying to converge.
|
59 |
+ It has *sub-servicable* output for the first 4 RVQ levels, but it's massive cope to try and use it as a model.
|
60 |
-
+ I believe there's hope to use it when I requantize my audio properly
|
|
|
61 |
* a model with a causal size >1 (sampling more than one token for the AR):
|
62 |
-
+ re-using an
|
63 |
+ there's an inherent periodic stutter that doesn't seem to be able to be trained out, but it might require exotic sampling methods.
|
64 |
+ unfortunately it requires:
|
65 |
+ either something similar to Medusa heads, where there's additional parameters to perform speculative sampling,
|
66 |
+ a solution similar to what VALL-E 2 uses with group token embeddings or whatever, which *will* harm the NAR tasks in an AR+NAR model.
|
|
|
67 |
|
68 |
Some current "achitectural features" are in-use, but their effects need to be experimented with further:
|
69 |
* `split_classifier_heads` is still a mystery whether it's truly helpful or not (each RVQ level gets its own output head).
|
|
|
57 |
+ the 44KHz model was erroneously assumed to be an even 44KHz, when in reality it's 44.1KHz. *All* of my audio has to be requantized, as there's some stuttering in it.
|
58 |
+ Because of this, training losses are high and it's having a hard time trying to converge.
|
59 |
+ It has *sub-servicable* output for the first 4 RVQ levels, but it's massive cope to try and use it as a model.
|
60 |
+
+ ~~I believe there's hope to use it when I requantize my audio properly.~~
|
61 |
+
+ Addendum: even after properly processing my audio, the loss is actually *worse* than before. I imagine DAC just cannot be used as an intermediary for an LM.
|
62 |
* a model with a causal size >1 (sampling more than one token for the AR):
|
63 |
+
+ re-using an existing model or training from scratch does not have fruitful results.
|
64 |
+ there's an inherent periodic stutter that doesn't seem to be able to be trained out, but it might require exotic sampling methods.
|
65 |
+ unfortunately it requires:
|
66 |
+ either something similar to Medusa heads, where there's additional parameters to perform speculative sampling,
|
67 |
+ a solution similar to what VALL-E 2 uses with group token embeddings or whatever, which *will* harm the NAR tasks in an AR+NAR model.
|
68 |
+
+ I just don't understand where the issue lies, since parallel decoding does work, as evidence with the NAR.
|
69 |
|
70 |
Some current "achitectural features" are in-use, but their effects need to be experimented with further:
|
71 |
* `split_classifier_heads` is still a mystery whether it's truly helpful or not (each RVQ level gets its own output head).
|