Concerns Regarding Noise and Distortion in Vokan TTS Audio Output

by umerahsan - opened Dec 16, 2024

Dec 16, 2024

Vokan TTS generates sound that is more expressive, with greater variation in voices. The voices show subtle changes and there is a noticeable increase in the presence of vocals in the generated audio. However, the audio does come with some noise, and it's not as crisp as one might expect. The generated sounds are more expressive and vocal-heavy, but the presence of background noise is evident, even in the audio samples on the model card. In particular, there is a distorted sound at the end of the audio, which detracts from the overall clarity.

Could you explain why this happens? Is it an issue with the dataset, the fine-tuning process, or possibly a problem with the model's initial pretraining?

Additionally, if I perform fine-tuning, are there ways to avoid or mitigate these issues? Any insights on how to address this would be greatly appreciated.

Korakoe

ShoukanLabs org Dec 16, 2024

•

edited Dec 16, 2024

Hey 👋

Thank you for trying our model, we are very aware of these issues, and they shouldn't be present in our next model (Vokan V2)

As for why this occurs, its likely a mix of data issues and the fine-tuning process. First of all, AniSpeech is by no means a clean dataset, and a few noisy samples have slipped through the cracks, likely contributing to the noise.

The fine-tuning process was also interrupted, and we had to change GPU's and our max_len (how much of the audio is sampled during training) to a lower value to resolve VRAM constraints, this is likely what leads to the trailing artifacts. The original StyleTTS2 model is also known to produce these for the same reason.

To mitigate these issues, please use clean data with as high of a max_len as possible (800 seems to work well with most fine tunes), we do some inference tricks on the vokan space to mitigate this, but there's only so much we can do...

I'd also like to recommend looking at the Tsukasa speech repo by respair on HF, it uses an improved StyleTTS2 architecture that should improve generation quality quite a bit! This repo should also allow you to make use of fp16 and bf16 training.

Korakoe changed discussion status to closed Dec 18, 2024

umerahsan

Dec 19, 2024

Hey @Korakoe Thanks for your reply!!.

Could you let me know when you plan to release Vokan V2?

Korakoe

ShoukanLabs org Dec 19, 2024

It's still TBD atm, we're working on quite a lot of exciting things

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment