Spaces:
Running
on
L4
Language identification gets worse after multilingual fine-tuning
I am checking that the language detection gets worse after performing a multilingual fine-tuning. I have tried fine-tuning with and without the language label, and in both cases the language detection gets worse on the trained languages.
Is it necessary to perform the Fine-Tuning Whisper on both Language Identification and Transcription tasks?
How is this done?
Any idea why this is happening?
Hey @andrespm - we would expect that Whisper performance would either stay the same or reduce after fine-tuning, apart from on the single language that we fine-tune it on. This is because the model is prone to 'forgetting' the knowledge that it acquired during pre-training and instead focussing entirely on the task presented during fine-tuning (i.e. tune the weights entirely for the multilingual ASR task)
What you can try doing is fine-tuning with LoRA / AdaLoRA - in my experience, these two paradigms significantly improve the model's ability to retain its pre-training knowledge during fine-tuning
See https://github.com/Vaibhavs10/fast-whisper-finetuning for details
Thank you very much for your reply.
I will try LoRA / AdaLoRA and check again the performance.
However, one of the languages I am working with is Galician, a language that is under-represented in Whisper and with which the language identification in Whisper's pre-trained models does not work very well (I also work with Spanish and Portuguese, and it tends to confuse them with these two languages, which are more represented in the base models).
I am also exploring the possibility of fine-tuning on both Language Identification and Transcription tasks. I have not found any examples of this. If you have more information that would be great.
I'm discussing it also in this github post:
https://github.com/openai/whisper/discussions/1454#discussioncomment-6345649
Again, thanks!!!!
Hi
@andrespm
Have you mitigated the decrease in LID quality after multilingual fine-tuning?
We are also observing this even when prefixing the labels with the correct language tag.
Since language identification is done as token prediction I think that we are already training on both ASR and LID when we set language token inside the label tokens.