update model card
Browse files
README.md
CHANGED
@@ -1,3 +1,57 @@
|
|
1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
license: mit
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
library_name: whisper
|
5 |
+
tags:
|
6 |
+
- translation
|
7 |
+
- speech
|
8 |
+
- audio
|
9 |
+
- automatic-speech-recognition
|
10 |
+
datasets:
|
11 |
+
- whisper
|
12 |
+
metrics:
|
13 |
+
- WER
|
14 |
license: mit
|
15 |
---
|
16 |
+
This model was forked from the original [OpenAI whisper model](https://github.com/openai/whisper).
|
17 |
+
|
18 |
+
# Whisper
|
19 |
+
|
20 |
+
## Model
|
21 |
+
Whisper is a multi-lingual speech-to-text model.
|
22 |
+
It takes in raw audio recordings from many languages and outputs transcriptions in the language of origin or translated to english.
|
23 |
+
The model first converts speech to spectrograms, then uses an auto-regressive transformer to decode the speech to text.
|
24 |
+
Here is an overview of the architecture:
|
25 |
+
|
26 |
+

|
27 |
+
|
28 |
+
For more information on the technical implementations, consult the [paper](https://cdn.openai.com/papers/whisper.pdf).
|
29 |
+
## Training Data
|
30 |
+
|
31 |
+
The model was trained on 680 000 hours of audio and associated transcripts trained from the internet.
|
32 |
+
The majority of the audio is in english (~65%) while the remainder is in other languages.
|
33 |
+
A total of 98 different languages were used in the dataset.
|
34 |
+
|
35 |
+

|
36 |
+
|
37 |
+
|
38 |
+
## Model Variations
|
39 |
+
|
40 |
+
OpenAI has released 9 different versions of the model, trained either on english-only audio or on multilingual data.
|
41 |
+
|
42 |
+
| Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
|
43 |
+
|:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|
|
44 |
+
| tiny | 39 M | `tiny.en` | `tiny` | ~1 GB | ~32x |
|
45 |
+
| base | 74 M | `base.en` | `base` | ~1 GB | ~16x |
|
46 |
+
| small | 244 M | `small.en` | `small` | ~2 GB | ~6x |
|
47 |
+
| medium | 769 M | `medium.en` | `medium` | ~5 GB | ~2x |
|
48 |
+
| large | 1550 M | N/A | `large` | ~10 GB | 1x |
|
49 |
+
|
50 |
+
## Limitations and bias
|
51 |
+
|
52 |
+
In the [paper](https://cdn.openai.com/papers/whisper.pdf), they find a direct corelation between performance on a given language and the amount of data available in the dataset.
|
53 |
+
As such, languages that are under-represented in the scraped dataset perform less well in whisper.
|
54 |
+
Because english is much more prevalent than other languages, the model will likely perform better in english.
|
55 |
+
This is shown in the following figure, where a lower word error rate (WER) indicates a better performance:
|
56 |
+
|
57 |
+

|