--- license: mit language: fr library_name: transformers pipeline_tag: automatic-speech-recognition thumbnail: null tags: - automatic-speech-recognition - hf-asr-leaderboard datasets: - mozilla-foundation/common_voice_17_0 - facebook/multilingual_librispeech - facebook/voxpopuli - gigant/african_accented_french - espnet/yodas metrics: - wer --- # Whisper-Large-V3-Distil-French-v0.2 A distilled version of Whisper with 2 decoder layers, optimized for French speech-to-text. Compared to [v0.1](https://huggingface.co/collections/bofenghuang/french-whisper-v01-64f9cc3cf625e46d12f0e4bd), this version extends the training to 30-second audio segments to maintain long-form transcription abilities. The training process used a ["patient" teacher](https://arxiv.org/abs/2106.05237) during distillation - meaning longer training times and more aggressive data augmentation - which improved overall performance. The model uses [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) as the teacher model while keeping the encoder architecture unchanged. This makes it suitable as a draft model for speculative decoding, potentially getting 2x inference speed while maintaining identical outputs by only adding 2 extra decoder layers and running the encoder just once. It can also serve as a standalone model to trade some accuracy for better efficiency, running 5.8x faster while using only 49% of the parameters. This [paper](https://arxiv.org/abs/2311.00430) also suggests that the distilled model may actually produce fewer hallucinations than the full model during long-form transcription. The model has been converted into multiple formats to ensure broad compatibility across libraries including transformers, openai-whisper, fasterwhisper, whisper.cpp, candle, mlx. ## Performance The model was evaluated on both short and long-form transcriptions, using in-distribution (ID) and out-of-distribution (OOD) datasets to assess accuracy, generalizability, and robustness. Note that Word Error Rate (WER) results shown here are [post-normalization](https://github.com/openai/whisper/blob/main/whisper/normalizers/basic.py), which includes converting text to lowercase and removing symbols and punctuation. All evaluation results on the public datasets can be found [here](). ### Short-Form Transcription ![eval-short-form](https://huggingface.co/bofenghuang/whisper-large-v3-distil-fr-v0.2/resolve/main/assets/eval_short_form.png) *Italic* indicates in-distribution (ID) evaluation, where test sets correspond to data distributions seen during training, typically yielding higher performance than out-of-distribution (OOD) evaluation. *~~Italic and strikethrough~~* denotes potential test set contamination - for example, when training and evaluation use different versions of Common Voice, raising the possibility of overlapping data. Due to the limited availability of out-of-distribution (OOD) and long-form French test sets, evaluation was also performed using internal test sets from [Zaion Lab](https://zaion.ai/) - consisting of human-annotated call center conversations with significant background noise and domain-specific terminology. ### Long-Form Transcription Long-form transcription evaluation used the 🤗 Hugging Face [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) with both [chunked](https://huggingface.co/blog/asr-chunking) (chunk_length_s=30) and original sequential decoding methods. ![eval-long-form](https://huggingface.co/bofenghuang/whisper-large-v3-distil-fr-v0.2/resolve/main/assets/eval_long_form.png)