Update README.md
Browse files
README.md
CHANGED
@@ -121,15 +121,15 @@ for fine-tuning.
|
|
121 |
Whisper was proposed in the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356)
|
122 |
by Alec Radford et al. from OpenAI. The original code repository can be found [here](https://github.com/openai/whisper).
|
123 |
|
124 |
-
`
|
125 |
|
126 |
1. The input uses 128 Mel frequency bins instead of 80
|
127 |
2. A new language token for Cantonese
|
128 |
|
129 |
-
The `
|
130 |
The model was trained for 2.0 epochs over this mixture dataset.
|
131 |
|
132 |
-
The `
|
133 |
|
134 |
|
135 |
**Disclaimer**: Content for this model card has partly been written by the Hugging Face team, and parts of it were
|
@@ -138,7 +138,7 @@ copied and pasted from the original model card.
|
|
138 |
## Model details
|
139 |
|
140 |
Whisper is a Transformer based encoder-decoder model, also referred to as a _sequence-to-sequence_ model.
|
141 |
-
It was trained on
|
142 |
|
143 |
The models were trained on either English-only data or multilingual data. The English-only models were trained
|
144 |
on the task of speech recognition. The multilingual models were trained on both speech recognition and speech
|
@@ -163,7 +163,7 @@ checkpoints are summarised in the following table with links to the models on th
|
|
163 |
|
164 |
## Usage
|
165 |
|
166 |
-
Whisper
|
167 |
install the Transformers library through the GitHub repo. For this example, we'll also install 🤗 Datasets to load toy
|
168 |
audio dataset from the Hugging Face Hub:
|
169 |
|
@@ -186,7 +186,7 @@ from datasets import load_dataset
|
|
186 |
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
187 |
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
|
188 |
|
189 |
-
model_id = "openai/
|
190 |
|
191 |
model = AutoModelForSpeechSeq2Seq.from_pretrained(
|
192 |
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
|
@@ -220,7 +220,7 @@ To transcribe a local audio file, simply pass the path to your audio file when y
|
|
220 |
|
221 |
### Long-Form Transcription
|
222 |
|
223 |
-
Through Transformers Whisper
|
224 |
is 9x faster than the sequential algorithm proposed by OpenAI in the Whisper paper (see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)).
|
225 |
|
226 |
To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. To activate batching, pass the argument `batch_size`:
|
@@ -234,7 +234,7 @@ from datasets import load_dataset
|
|
234 |
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
235 |
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
|
236 |
|
237 |
-
model_id = "openai/
|
238 |
|
239 |
model = AutoModelForSpeechSeq2Seq.from_pretrained(
|
240 |
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
|
@@ -272,7 +272,7 @@ result = pipe("https://huggingface.co/datasets/sanchit-gandhi/librispeech_long/r
|
|
272 |
|
273 |
### Speculative Decoding
|
274 |
|
275 |
-
|
276 |
ensures the exact same outputs as Whisper are obtained while being 2 times faster. This makes it the perfect drop-in
|
277 |
replacement for existing Whisper pipelines, since the same outputs are guaranteed.
|
278 |
|
@@ -287,7 +287,7 @@ from datasets import load_dataset
|
|
287 |
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
288 |
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
|
289 |
|
290 |
-
assistant_model_id = "
|
291 |
|
292 |
assistant_model = AutoModelForCausalLM.from_pretrained(
|
293 |
assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
|
@@ -375,7 +375,7 @@ In particular, we caution against using Whisper models to transcribe recordings
|
|
375 |
|
376 |
## Training Data
|
377 |
|
378 |
-
The models are trained on
|
379 |
|
380 |
As discussed in [the accompanying paper](https://cdn.openai.com/papers/whisper.pdf), we see that performance on transcription in a given language is directly correlated with the amount of training data we employ in that language.
|
381 |
|
|
|
121 |
Whisper was proposed in the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356)
|
122 |
by Alec Radford et al. from OpenAI. The original code repository can be found [here](https://github.com/openai/whisper).
|
123 |
|
124 |
+
Whisper `large-v3` has the same architecture as the previous large models except the following minor differences:
|
125 |
|
126 |
1. The input uses 128 Mel frequency bins instead of 80
|
127 |
2. A new language token for Cantonese
|
128 |
|
129 |
+
The Whisper `large-v3` model is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper `large-v2`.
|
130 |
The model was trained for 2.0 epochs over this mixture dataset.
|
131 |
|
132 |
+
The `large-v3` model shows improved performance over a wide variety of languages, performs lower than 60% error rate on Common Voice 15 and Fleurs, shows 10% to 20% reduction of errors compared to Whisper `large-v2`.
|
133 |
|
134 |
|
135 |
**Disclaimer**: Content for this model card has partly been written by the Hugging Face team, and parts of it were
|
|
|
138 |
## Model details
|
139 |
|
140 |
Whisper is a Transformer based encoder-decoder model, also referred to as a _sequence-to-sequence_ model.
|
141 |
+
It was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper `large-v2`.
|
142 |
|
143 |
The models were trained on either English-only data or multilingual data. The English-only models were trained
|
144 |
on the task of speech recognition. The multilingual models were trained on both speech recognition and speech
|
|
|
163 |
|
164 |
## Usage
|
165 |
|
166 |
+
Whisper `large-v3` is supported in Hugging Face 🤗 Transformers through the `main` branch in the Transformers repo. To run the model, first
|
167 |
install the Transformers library through the GitHub repo. For this example, we'll also install 🤗 Datasets to load toy
|
168 |
audio dataset from the Hugging Face Hub:
|
169 |
|
|
|
186 |
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
187 |
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
|
188 |
|
189 |
+
model_id = "openai/whisper-large-v3"
|
190 |
|
191 |
model = AutoModelForSpeechSeq2Seq.from_pretrained(
|
192 |
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
|
|
|
220 |
|
221 |
### Long-Form Transcription
|
222 |
|
223 |
+
Through Transformers Whisper uses a chunked algorithm to transcribe long-form audio files (> 30-seconds). In practice, this chunked long-form algorithm
|
224 |
is 9x faster than the sequential algorithm proposed by OpenAI in the Whisper paper (see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)).
|
225 |
|
226 |
To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. To activate batching, pass the argument `batch_size`:
|
|
|
234 |
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
235 |
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
|
236 |
|
237 |
+
model_id = "openai/whisper-large-v3"
|
238 |
|
239 |
model = AutoModelForSpeechSeq2Seq.from_pretrained(
|
240 |
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
|
|
|
272 |
|
273 |
### Speculative Decoding
|
274 |
|
275 |
+
Whisper `tiny` can be used as an assistant model to Whisper for speculative decoding. Speculative decoding mathematically
|
276 |
ensures the exact same outputs as Whisper are obtained while being 2 times faster. This makes it the perfect drop-in
|
277 |
replacement for existing Whisper pipelines, since the same outputs are guaranteed.
|
278 |
|
|
|
287 |
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
288 |
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
|
289 |
|
290 |
+
assistant_model_id = "openai/whisper-tiny"
|
291 |
|
292 |
assistant_model = AutoModelForCausalLM.from_pretrained(
|
293 |
assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
|
|
|
375 |
|
376 |
## Training Data
|
377 |
|
378 |
+
The models are trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper `large-v2`.
|
379 |
|
380 |
As discussed in [the accompanying paper](https://cdn.openai.com/papers/whisper.pdf), we see that performance on transcription in a given language is directly correlated with the amount of training data we employ in that language.
|
381 |
|