jerpint commited on
Commit
b7d1a41
·
1 Parent(s): 140d6bf

update model card

Browse files
Files changed (1) hide show
  1. README.md +54 -0
README.md CHANGED
@@ -1,3 +1,57 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ library_name: whisper
5
+ tags:
6
+ - translation
7
+ - speech
8
+ - audio
9
+ - automatic-speech-recognition
10
+ datasets:
11
+ - whisper
12
+ metrics:
13
+ - WER
14
  license: mit
15
  ---
16
+ This model was forked from the original [OpenAI whisper model](https://github.com/openai/whisper).
17
+
18
+ # Whisper
19
+
20
+ ## Model
21
+ Whisper is a multi-lingual speech-to-text model.
22
+ It takes in raw audio recordings from many languages and outputs transcriptions in the language of origin or translated to english.
23
+ The model first converts speech to spectrograms, then uses an auto-regressive transformer to decode the speech to text.
24
+ Here is an overview of the architecture:
25
+
26
+ ![model_architecure](https://github.com/jerpint/whisper/raw/main/approach.png)
27
+
28
+ For more information on the technical implementations, consult the [paper](https://cdn.openai.com/papers/whisper.pdf).
29
+ ## Training Data
30
+
31
+ The model was trained on 680 000 hours of audio and associated transcripts trained from the internet.
32
+ The majority of the audio is in english (~65%) while the remainder is in other languages.
33
+ A total of 98 different languages were used in the dataset.
34
+
35
+ ![image](https://user-images.githubusercontent.com/18450628/204110014-e2684385-d790-4dd7-8ce1-47168efb2726.png)
36
+
37
+
38
+ ## Model Variations
39
+
40
+ OpenAI has released 9 different versions of the model, trained either on english-only audio or on multilingual data.
41
+
42
+ | Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
43
+ |:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|
44
+ | tiny | 39 M | `tiny.en` | `tiny` | ~1 GB | ~32x |
45
+ | base | 74 M | `base.en` | `base` | ~1 GB | ~16x |
46
+ | small | 244 M | `small.en` | `small` | ~2 GB | ~6x |
47
+ | medium | 769 M | `medium.en` | `medium` | ~5 GB | ~2x |
48
+ | large | 1550 M | N/A | `large` | ~10 GB | 1x |
49
+
50
+ ## Limitations and bias
51
+
52
+ In the [paper](https://cdn.openai.com/papers/whisper.pdf), they find a direct corelation between performance on a given language and the amount of data available in the dataset.
53
+ As such, languages that are under-represented in the scraped dataset perform less well in whisper.
54
+ Because english is much more prevalent than other languages, the model will likely perform better in english.
55
+ This is shown in the following figure, where a lower word error rate (WER) indicates a better performance:
56
+
57
+ ![model_performance](https://github.com/jerpint/whisper/raw/main/language-breakdown.svg)