Finetuning guide? Supported audio formats while finetuning?

#4
by neurlang - opened

please tell me about finetuning this system, what is the VRAM requirement? In what format (preferably CSV,TSV) we provide audio paths+transcripts?
How to set the language for transcripts?

Hi @neurlang

The Canary training script takes dataset manifest as an input in a jsonl format. Our tutorial has details on how to create the manifest file and how to finetune the canary-flash models.

The Canary-180M-Flash was trained on 32 A100 80GB GPUs. Based on the size of your GPU, you can scale the batch size. The effective batch size can be controlled using trainer.accumulate_grad_batches and the number of GPUs. Be sure to tune the learning rate accordingly.

Hope that helps, please feel free to reach out if you have more questions!

Is it possible to finetune just the vocabulary and language, comparable to training KenLM/n-gram language models in the older CTC models? It was quite neat to train on text only instead of audio and text.

There is the class BeamSearchSequenceGeneratorWithLanguageModel for example.
Could this be utilized to quickly fine tune the transcriptions to an expert domain?

NVIDIA org

@halbefn you can try decoding with n-gram LM
It's available in the main branch.
For details (building and using LM), please see the description of the PR https://github.com/NVIDIA/NeMo/pull/12730

@artbataev Thank you, it works quite well.
For anyone reading this, changing e.g. multitask_decoding.strategy="beam" to decoding.strategy="beam" allows you to use the KenLM models on longer audio files with https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_chunked_inference/aed/speech_to_text_aed_chunked_infer.py

Edit: However, if you add "timestamps=True" to speech_to_text_aed_chunked_infer.py you get nonsense transcripts.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment