Have you ever train a 44k model?

by LukeJacob2023 - opened 15 days ago

LukeJacob2023

15 days ago

I have failed to train from scratch. The model can not learn alignment with a tiny dataset. So I want to try your solution.

sinhprous

Owner 15 days ago

Hello, how much data do you have? which language? Actually I didn't train it from scratch, I start from original F5TTS weight. But my solution might help with better alignment.

LukeJacob2023

15 days ago

about 12 hours, Indonesia.

sinhprous

Owner 15 days ago

This method uses phonemes instead of raw text, and uses force alignment during training. Although my training vocabulary differs from the original F5TTS model, but I still utilize the pre-trained weight (i.e: only re-initialize the text embedding layer), because it has the capacity to make sound so the training should be faster. I guess you can utilize the pre-trained model instead of training from scratch.
Previously I did some experiments with LJSpeech and the model can learn with only 10 hour of dataset. I am not sure about the results if we train the model on a different language. Currently I am doing some experiments with another language (Vietnamese), maybe we will get more insight.

LukeJacob2023

14 days ago

This comment has been hidden

LukeJacob2023

14 days ago

•

edited 14 days ago

hello @sinhprous , Can you complete the inference code for f5-tts_infer-cli?
And the finetune need some details steps, includes declare the language code.

LukeJacob2023

14 days ago

When I finetune with your fork, it will gives error but can continue, is it ok?
Missing keys: ['ema_model.transformer.text_embed.text_embed.weight']
Unexpected keys: []
Missing keys: ['transformer.text_embed.text_embed.weight', 'duration_predictor.text_embed.weight', 'duration_predictor.conv_1.weight', 'duration_predictor.conv_1.bias', 'duration_predictor.norm_1.gamma', 'duration_predictor.norm_1.beta', 'duration_predictor.conv_2.weight', 'duration_predictor.conv_2.bias', 'duration_predictor.norm_2.gamma', 'duration_predictor.norm_2.beta', 'duration_predictor.proj.weight', 'duration_predictor.proj.bias']
Unexpected keys: []

sinhprous

Owner 14 days ago

it's okay because it re-init the text embedding layer and it adds a duration predictor.

sinhprous

Owner 14 days ago

okay I will complete the f5-tts_infer-cli. In the meantime, you can use the notebook to do inference

LukeJacob2023

13 days ago

•

edited 13 days ago

Hello, @sinhprous . I have finish training Indonesia. The result is not good, the wer is much higher than the official code.

sinhprous

Owner 13 days ago

could you share some samples? how many epochs you trained?

LukeJacob2023

13 days ago

350k, I have tried 150k 200k and 300k

sinhprous

Owner 13 days ago

•

edited 13 days ago

I faced the same for my Vietnamese training. results are bad.
maybe the alignment is wrong with languages other than English.
if it is possible, could you share one of your training sample? (audio, text and the alignment matrix)

LukeJacob2023

13 days ago

sorry, it is a private data. I have begin to finetune base on official again.

LukeJacob2023

13 days ago

I faced the same for my Vietnamese training. results are bad.
maybe the alignment is wrong with languages other than English.
if it is possible, could you share one of your training sample? (audio, text and the alignment matrix)

Do you change the language for espeak?

sinhprous

Owner 12 days ago

yes I changed the language for espeak. After reviewing I think ctc-forced-aligner 's results are not correct for my dataset. I've post-processed the alignment results a bit and started training again.

LukeJacob2023

12 days ago

yes I changed the language for espeak. After reviewing I think ctc-forced-aligner 's results are not correct for my dataset. I've post-processed the alignment results a bit and started training again.

ok, waiting for your success.

sinhprous

Owner 10 days ago

@LukeJacob2023 did you success on your training? After I fixed the alignment preparation my Vietnamese training goes well.

LukeJacob2023

9 days ago

•

edited 9 days ago

@LukeJacob2023 did you success on your training? After I fixed the alignment preparation my Vietnamese training goes well.

yes, I trained for 1029k on official code, get a good result, with some speed and stop problems, but not much, about 5 problems in 3 minutes output audio. You can update your fork, I will have a try. If it can save much train time and improve inference stable, only loss a little nature, I think it is a good solution. Especially for those don't have super GPUs and large datasets like me. You can enable your fork's issues, so may be we can discuss on it.
@sinhprous

LukeJacob2023

7 days ago

•

edited 7 days ago

Hello, @sinhprous can you update your fork or checkpoint of vi?

sinhprous

Owner 3 days ago

Hey @LukeJacob2023 sorry for late reply, I am busy with other company projects, I can update this weekend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment