|
--- |
|
language: multilingual |
|
datasets: |
|
- common_voice |
|
- multilingual_librispeech |
|
- covost2 |
|
tags: |
|
- speech |
|
- xls_r |
|
- automatic-speech-recognition |
|
- xls_r_translation |
|
pipeline_tag: automatic-speech-recognition |
|
license: apache-2.0 |
|
widget: |
|
- example_title: Swedish |
|
src: https://cdn-media.huggingface.co/speech_samples/cv_swedish_1.mp3 |
|
- example_title: Arabic |
|
src: https://cdn-media.huggingface.co/speech_samples/common_voice_ar_19058308.mp3 |
|
- example_title: Russian |
|
src: https://cdn-media.huggingface.co/speech_samples/common_voice_ru_18849022.mp3 |
|
- example_title: German |
|
src: https://cdn-media.huggingface.co/speech_samples/common_voice_de_17284683.mp3 |
|
- example_title: French |
|
src: https://cdn-media.huggingface.co/speech_samples/common_voice_fr_17299386.mp3 |
|
- example_title: Indonesian |
|
src: https://cdn-media.huggingface.co/speech_samples/common_voice_id_19051309.mp3 |
|
- example_title: Italian |
|
src: https://cdn-media.huggingface.co/speech_samples/common_voice_it_17415776.mp3 |
|
- example_title: Japanese |
|
src: https://cdn-media.huggingface.co/speech_samples/common_voice_ja_19482488.mp3 |
|
- example_title: Mongolian |
|
src: https://cdn-media.huggingface.co/speech_samples/common_voice_mn_18565396.mp3 |
|
- example_title: Dutch |
|
src: https://cdn-media.huggingface.co/speech_samples/common_voice_nl_17691471.mp3 |
|
- example_title: Russian |
|
src: https://cdn-media.huggingface.co/speech_samples/common_voice_ru_18849022.mp3 |
|
- example_title: Turkish |
|
src: https://cdn-media.huggingface.co/speech_samples/common_voice_tr_17341280.mp3 |
|
- example_title: Catalan |
|
src: https://cdn-media.huggingface.co/speech_samples/common_voice_ca_17367522.mp3 |
|
- example_title: English |
|
src: https://cdn-media.huggingface.co/speech_samples/common_voice_en_18301577.mp3 |
|
- example_title: Dutch |
|
src: https://cdn-media.huggingface.co/speech_samples/common_voice_nl_17691471.mp3 |
|
--- |
|
|
|
# Wav2Vec2-XLS-R-2b-21-EN |
|
|
|
Facebook's Wav2Vec2 XLS-R fine-tuned for **Speech Translation.** |
|
|
|
![model image](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/xls_r.png) |
|
|
|
This is a [SpeechEncoderDecoderModel](https://huggingface.co/transformers/model_doc/speechencoderdecoder.html) model. |
|
The encoder was warm-started from the [**`facebook/wav2vec2-xls-r-1b`**](https://huggingface.co/facebook/wav2vec2-xls-r-1b) checkpoint and |
|
the decoder from the [**`facebook/mbart-large-50`**](https://huggingface.co/facebook/mbart-large-50) checkpoint. |
|
Consequently, the encoder-decoder model was fine-tuned on 21 `{lang}` -> `en` translation pairs of the [Covost2 dataset](https://huggingface.co/datasets/covost2). |
|
|
|
The model can translate from the following spoken languages `{lang}` -> `en` (English): |
|
|
|
{`fr`, `de`, `es`, `ca`, `it`, `ru`, `zh-CN`, `pt`, `fa`, `et`, `mn`, `nl`, `tr`, `ar`, `sv-SE`, `lv`, `sl`, `ta`, `ja`, `id`, `cy`} -> `en` |
|
|
|
For more information, please refer to Section *5.1.2* of the [official XLS-R paper](https://arxiv.org/abs/2111.09296). |
|
|
|
## Usage |
|
|
|
### Demo |
|
|
|
The model can be tested directly on the speech recognition widget on this model card! |
|
Simple record some audio in one of the possible spoken languages or pick an example audio file to see how well the checkpoint can translate the input. |
|
|
|
### Example |
|
|
|
As this a standard sequence to sequence transformer model, you can use the `generate` method to generate the |
|
transcripts by passing the speech features to the model. |
|
|
|
You can use the model directly via the ASR pipeline |
|
|
|
```python |
|
from datasets import load_dataset |
|
from transformers import pipeline |
|
|
|
# replace following lines to load an audio file of your choice |
|
librispeech_en = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation") |
|
audio_file = librispeech_en[0]["file"] |
|
|
|
asr = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-xls-r-1b-21-to-en", feature_extractor="facebook/wav2vec2-xls-r-1b-21-to-en") |
|
|
|
translation = asr(audio_file) |
|
``` |
|
|
|
or step-by-step as follows: |
|
|
|
```python |
|
import torch |
|
from transformers import Speech2Text2Processor, SpeechEncoderDecoder |
|
from datasets import load_dataset |
|
|
|
model = SpeechEncoderDecoder.from_pretrained("facebook/wav2vec2-xls-r-1b-21-to-en") |
|
processor = Speech2Text2Processor.from_pretrained("facebook/wav2vec2-xls-r-1b-21-to-en") |
|
|
|
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation") |
|
|
|
inputs = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["array"]["sampling_rate"], return_tensors="pt") |
|
generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"]) |
|
transcription = processor.batch_decode(generated_ids) |
|
``` |
|
|
|
## Results `{lang}` -> `en` |
|
|
|
See the row of **XLS-R (1B)** for the performance on [Covost2](https://huggingface.co/datasets/covost2) for this model. |
|
|
|
![results image](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/X-%3EEnglish.png) |
|
|
|
## More XLS-R models for `{lang}` -> `en` Speech Translation |
|
|
|
- [Wav2Vec2-XLS-R-300M-21-EN](https://huggingface.co/facebook/wav2vec2-xls-r-300m-21-to-en) |
|
- [Wav2Vec2-XLS-R-1B-21-EN](https://huggingface.co/facebook/wav2vec2-xls-r-1b-21-to-en) |
|
- [Wav2Vec2-XLS-R-2B-21-EN](https://huggingface.co/facebook/wav2vec2-xls-r-2b-21-to-en) |
|
- [Wav2Vec2-XLS-R-2B-22-16](https://huggingface.co/facebook/wav2vec2-xls-r-2b-22-to-16) |
|
|