File size: 6,950 Bytes
64967ea 765fb8c 64967ea b2d3244 252e580 3d87c53 b2d3244 64967ea 252e580 64967ea 62592d7 64967ea 252e580 765fb8c 252e580 8a3b670 64967ea 8a3b670 64967ea c3a4b82 64967ea 5d103dc 64967ea 02dacc4 64967ea a005325 62592d7 a005325 62592d7 a005325 62592d7 a005325 62592d7 a005325 64967ea ec0a4d2 64967ea ec0a4d2 64967ea ec0a4d2 64967ea ec0a4d2 64967ea 765fb8c 64967ea 765fb8c 64967ea 765fb8c 64967ea 765fb8c 64967ea a35ec72 04ecb59 a35ec72 b54ced2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 |
---
language: en
datasets:
- common_voice
- mozilla-foundation/common_voice_6_0
metrics:
- wer
- cer
tags:
- audio
- automatic-speech-recognition
- en
- hf-asr-leaderboard
- mozilla-foundation/common_voice_6_0
- robust-speech-event
- speech
- xlsr-fine-tuning-week
license: apache-2.0
model-index:
- name: XLSR Wav2Vec2 English by Jonatas Grosman
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice en
type: common_voice
args: en
metrics:
- name: Test WER
type: wer
value: 19.06
- name: Test CER
type: cer
value: 7.69
- name: Test WER (+LM)
type: wer
value: 14.81
- name: Test CER (+LM)
type: cer
value: 6.84
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Robust Speech Event - Dev Data
type: speech-recognition-community-v2/dev_data
args: en
metrics:
- name: Dev WER
type: wer
value: 27.72
- name: Dev CER
type: cer
value: 11.65
- name: Dev WER (+LM)
type: wer
value: 20.85
- name: Dev CER (+LM)
type: cer
value: 11.01
base_model:
- jonatasgrosman/wav2vec2-large-xlsr-53-english
---
# Disclaimer and Requirements
This model is a clone of [**jonatasgrosman/wav2vec2-large-xlsr-53-english**](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english) compressed using ZipNN. Compressed losslessly to 88% its original size, ZipNN saved ~0.2GB in storage and potentially ~4PB in data transfer **monthly**.
### Requirement
In order to use the model, ZipNN is necessary:
```bash
pip install zipnn
```
### Use This Model
```python
# Use a pipeline as a high-level helper
from transformers import pipeline
from zipnn import zipnn_hf
zipnn_hf()
pipe = pipeline("automatic-speech-recognition", model="royleibov/wav2vec2-large-xlsr-53-english")
```
```python
# Load model directly
from transformers import AutoProcessor, AutoModelForCTC
from zipnn import zipnn_hf
zipnn_hf()
processor = AutoProcessor.from_pretrained("royleibov/wav2vec2-large-xlsr-53-english")
model = AutoModelForCTC.from_pretrained("royleibov/wav2vec2-large-xlsr-53-english")
```
### ZipNN
ZipNN also allows you to seemlessly save local disk space in your cache after the model is downloaded.
To compress the cached model, simply run:
```bash
python zipnn_compress_path.py safetensors --model royleibov/granite-3.0-8b-instruct-ZipNN-Compressed --hf_cache
```
The model will be decompressed automatically and safely as long as `zipnn_hf()` is added at the top of the file like in the [example above](#use-this-model).
To decompress manualy, simply run:
```bash
python zipnn_decompress_path.py --model royleibov/granite-3.0-8b-instruct-ZipNN-Compressed --hf_cache
```
# Fine-tuned XLSR-53 large model for speech recognition in English
Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on English using the train and validation splits of [Common Voice 6.1](https://huggingface.co/datasets/common_voice).
When using this model, make sure that your speech input is sampled at 16kHz.
This model has been fine-tuned thanks to the GPU credits generously given by the [OVHcloud](https://www.ovhcloud.com/en/public-cloud/ai-training/) :)
The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint
## Usage
The model can be used directly (without a language model) as follows...
Using the [HuggingSound](https://github.com/jonatasgrosman/huggingsound) library:
```python
from huggingsound import SpeechRecognitionModel
model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-english")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]
transcriptions = model.transcribe(audio_paths)
```
Writing your own inference script:
```python
import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
LANG_ID = "en"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-english"
SAMPLES = 10
test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
batch["speech"] = speech_array
batch["sentence"] = batch["sentence"].upper()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)
for i, predicted_sentence in enumerate(predicted_sentences):
print("-" * 100)
print("Reference:", test_dataset[i]["sentence"])
print("Prediction:", predicted_sentence)
```
| Reference | Prediction |
| ------------- | ------------- |
| "SHE'LL BE ALL RIGHT." | SHE'LL BE ALL RIGHT |
| SIX | SIX |
| "ALL'S WELL THAT ENDS WELL." | ALL AS WELL THAT ENDS WELL |
| DO YOU MEAN IT? | DO YOU MEAN IT |
| THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE, BUT STILL CAUSES REGRESSIONS. | THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE BUT STILL CAUSES REGRESSION |
| HOW IS MOZILLA GOING TO HANDLE AMBIGUITIES LIKE QUEUE AND CUE? | HOW IS MOSLILLAR GOING TO HANDLE ANDBEWOOTH HIS LIKE Q AND Q |
| "I GUESS YOU MUST THINK I'M KINDA BATTY." | RUSTIAN WASTIN PAN ONTE BATTLY |
| NO ONE NEAR THE REMOTE MACHINE YOU COULD RING? | NO ONE NEAR THE REMOTE MACHINE YOU COULD RING |
| SAUCE FOR THE GOOSE IS SAUCE FOR THE GANDER. | SAUCE FOR THE GUICE IS SAUCE FOR THE GONDER |
| GROVES STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD. | GRAFS STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD |
## Evaluation
1. To evaluate on `mozilla-foundation/common_voice_6_0` with split `test`
```bash
python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-english --dataset mozilla-foundation/common_voice_6_0 --config en --split test
```
2. To evaluate on `speech-recognition-community-v2/dev_data`
```bash
python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-english --dataset speech-recognition-community-v2/dev_data --config en --split validation --chunk_length_s 5.0 --stride_length_s 1.0
```
## Citation
If you want to cite this model you can use this:
```bibtex
@misc{grosman2021xlsr53-large-english,
title={Fine-tuned {XLSR}-53 large model for speech recognition in {E}nglish},
author={Grosman, Jonatas},
howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english}},
year={2021}
}
``` |