Multi-models of Space (Kokoro)

#16
by ecyht2 - opened

There is a new version for StyleTTS Kokoro shown in this post. Maybe there should be separate model for it?

In essence you are asking for multi-voice support. And yes it is something I'd like to have. It is the only reason Parler Large is not an available model as it is in the same TTS Space.

Where as strict per-model Leaderboard would not get reliable data. Low amount of data points for each separate fine-tune... 😕

[edit] And the reason kokoro isn't currently working is because the API endpoints changed. I'll fix it when this Space decides to do a soft reset of the Gradio client.

[edit] And the reason kokoro isn't currently working is because the API endpoints changed. I'll fix it when this Space decides to do a soft reset of the Gradio client.

Oops, thought I was keeping it backwards compatible, but clearly not. I have restored API parity with some temporary hacks on my end.

For now, Kokoro should be working in the Arena again, tested:

https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena/discussions/17 should future-proof against API-breaking changes. Once that PR lands, I can remove the temporary hacks.

Pendrokar changed discussion title from New Version of StyleTTS Kokoro to Multi-models of Space (Kokoro)

@Pendrokar Kokoro v0.19 remains the stable version for this Arena, even with v0.22 dropping. I think the English differences should be minor, and there's a chance v0.22 might be superseded soon™ if/when I crack Hindi or stumble across more training data. IMO, it doesn't make sense to bump Kokoro v0.19 since it's already in the 🥇 spot by a decent margin. As the saying goes, "If it ain't broke, don't fix it."

Aside from going multilingual, v0.22 includes slightly better tokenization for hard English text (not always perfect, but better):

After I read that you can read, I can associate myself with your associates too.
ˈæftɚɹ aɪ ɹˈɛd ðæt juː kæn ɹˈiːd, aɪ kæn ɐsˈoʊsɪˌeɪt maɪsˈɛlf wɪð jʊɹ ɐsˈoʊsiəts tˈuː.

On the other hand, v0.19 returns the following:

After I read that you can read, I can associate myself with your associates too.
ˈæftɚɹ aɪ ɹˈiːd ðæt juː kæn ɹˈiːd, aɪ kæn ɐsˈoʊsɪˌeɪt maɪsˈɛlf wɪð jʊɹ ɐsˈoʊsɪˌeɪts tˈuː.

Despite the second "associates" still sounding fine in v0.19, note the input phonemes are wrong, which means the model has learned to compensate for the g2p error. That's arguably fine, but (1) relying on the model to make up for g2p errors isn't robust for all texts, and (2) it's wasted neurons that could be going towards learning more useful patterns. When you only have 82M params and you want to stuff a bunch of languages into the model, I think it's worth aggressively prosecuting these g2p errors.

Sign up or log in to comment