ecker
/

vall-e

ecker commited on Jun 13

Commit

c692566

•

1 Parent(s): 55cdbd9

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -26,9 +26,11 @@ To reiterate, this is ***by no means*** complete. I am not passing this off as c
 		+ The current RVQ level is included as a token as well to help guide NAR tasks better.
 	+ This model received a few days of training on my 4xV100s, stepping up the duration window to *try* and better make the model inference for longer utterances.
 		+ Some sessions end up training the current duration window for a few epochs, but I don't know how much it affected things.
-	+ However, it seems to *only* do well with long utterances. Short utterances fumble. I believe further training with a variety of durations should allow the AR to handle a variety of durations.
-		- I believe the "slowly stepping up the context length" only works for text, and not audio.
 	+ Zero-shot performance leaves a bit to be desired, as it did not receive the special training prioritizing shuffling between speakers rather than the global pool of utterances.
 	+ Testing showed that, despite also stepping up the prompt duration, it *really* likes three second prompts.
 	+ Definitely needs additional training.

 		+ The current RVQ level is included as a token as well to help guide NAR tasks better.
 	+ This model received a few days of training on my 4xV100s, stepping up the duration window to *try* and better make the model inference for longer utterances.
 		+ Some sessions end up training the current duration window for a few epochs, but I don't know how much it affected things.
+	+ ~~However, it seems to *only* do well with long utterances. Short utterances fumble. I believe further training with a variety of durations should allow the AR to handle a variety of durations.~~
+		- ~~I believe the "slowly stepping up the context length" only works for text, and not audio.~~
+        - Addendum: Additional brief training for a variety of duration lengths seemed to have mostly fixed this issue.
 	+ Zero-shot performance leaves a bit to be desired, as it did not receive the special training prioritizing shuffling between speakers rather than the global pool of utterances.
+        - Addendum: Additional brief training for sampling based on speaker per "epoch" (per dataloader, not dataset) seemed to slightly improve it.
 	+ Testing showed that, despite also stepping up the prompt duration, it *really* likes three second prompts.
 	+ Definitely needs additional training.