Update README.md
Browse files
README.md
CHANGED
@@ -26,9 +26,11 @@ To reiterate, this is ***by no means*** complete. I am not passing this off as c
|
|
26 |
+ The current RVQ level is included as a token as well to help guide NAR tasks better.
|
27 |
+ This model received a few days of training on my 4xV100s, stepping up the duration window to *try* and better make the model inference for longer utterances.
|
28 |
+ Some sessions end up training the current duration window for a few epochs, but I don't know how much it affected things.
|
29 |
-
+ However, it seems to *only* do well with long utterances. Short utterances fumble. I believe further training with a variety of durations should allow the AR to handle a variety of durations
|
30 |
-
- I believe the "slowly stepping up the context length" only works for text, and not audio
|
|
|
31 |
+ Zero-shot performance leaves a bit to be desired, as it did not receive the special training prioritizing shuffling between speakers rather than the global pool of utterances.
|
|
|
32 |
+ Testing showed that, despite also stepping up the prompt duration, it *really* likes three second prompts.
|
33 |
+ Definitely needs additional training.
|
34 |
|
|
|
26 |
+ The current RVQ level is included as a token as well to help guide NAR tasks better.
|
27 |
+ This model received a few days of training on my 4xV100s, stepping up the duration window to *try* and better make the model inference for longer utterances.
|
28 |
+ Some sessions end up training the current duration window for a few epochs, but I don't know how much it affected things.
|
29 |
+
+ ~~However, it seems to *only* do well with long utterances. Short utterances fumble. I believe further training with a variety of durations should allow the AR to handle a variety of durations.~~
|
30 |
+
- ~~I believe the "slowly stepping up the context length" only works for text, and not audio.~~
|
31 |
+
- Addendum: Additional brief training for a variety of duration lengths seemed to have mostly fixed this issue.
|
32 |
+ Zero-shot performance leaves a bit to be desired, as it did not receive the special training prioritizing shuffling between speakers rather than the global pool of utterances.
|
33 |
+
- Addendum: Additional brief training for sampling based on speaker per "epoch" (per dataloader, not dataset) seemed to slightly improve it.
|
34 |
+ Testing showed that, despite also stepping up the prompt duration, it *really* likes three second prompts.
|
35 |
+ Definitely needs additional training.
|
36 |
|