saattrupdan commited on
Commit
457b8ea
·
verified ·
1 Parent(s): 0b01f74

Finished finetuning 🎉

Browse files
Files changed (4) hide show
  1. README.md +31 -186
  2. language_model/3gram.bin +2 -2
  3. language_model/unigrams.txt +2 -2
  4. vocab.json +44 -1
README.md CHANGED
@@ -3,205 +3,50 @@ library_name: transformers
3
  language:
4
  - da
5
  license: openrail
6
- base_model: chcaa/xls-r-300m-danish
7
- datasets:
8
- - alexandrainst/coral
9
- metrics:
10
- - wer
11
- - cer
12
  model-index:
13
- - name: roest-315m
14
- results:
15
- - task:
16
- name: Automatic Speech Recognition
17
- type: automatic-speech-recognition
18
- dataset:
19
- name: CoRal read-aloud
20
- type: alexandrainst/coral
21
- split: test
22
- args: read_aloud
23
- metrics:
24
- - name: CER
25
- type: cer
26
- value: 6.6% ± 0.2%
27
- - name: WER
28
- type: wer
29
- value: 17.0% ± 0.4%
30
- - task:
31
- name: Automatic Speech Recognition
32
- type: automatic-speech-recognition
33
- dataset:
34
- name: Danish Common Voice 17
35
- type: mozilla-foundation/common_voice_17_0
36
- split: test
37
- args: da
38
- metrics:
39
- - name: CER
40
- type: cer
41
- value: 6.6% ± 0.6%
42
- - name: WER
43
- type: wer
44
- value: 16.7% ± 0.8%
45
- pipeline_tag: automatic-speech-recognition
46
  ---
47
 
48
- # Røst-315m
 
49
 
50
- This is a Danish state-of-the-art speech recognition model, trained by [the Alexandra
51
- Institute](https://alexandra.dk/).
52
 
53
- Try it out in [our interactive demo](https://huggingface.co/spaces/alexandrainst/roest-demo)!
54
 
 
55
 
56
- ## Quick Start
57
- Start by installing the required libraries:
58
 
59
- ```shell
60
- $ pip install transformers kenlm pyctcdecode
61
- ```
62
 
63
- Next you can use the model using the `transformers` Python package as follows:
64
 
65
- ```python
66
- >>> from transformers import pipeline
67
- >>> audio = get_audio() # 16kHz raw audio array
68
- >>> transcriber = pipeline(model="alexandrainst/roest-315m")
69
- >>> transcriber(audio)
70
- {'text': 'your transcription'}
71
- ```
72
 
 
73
 
74
- ## Evaluation Results
75
 
76
- We have evaluated both our and existing models on the CoRal test set as well as the
77
- Danish Common Voice 17 test set. To ensure as robust an evaluation as possible, we have
78
- bootstrapped the results 1000 times and report here the mean scores along with a 95%
79
- confidence interval (lower is better; best scores in **bold**, second-best in
80
- *italics*):
81
 
82
- | Model | Number of parameters | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER | [Danish Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/da/test) CER | [Danish Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/da/test) WER |
83
- |:---|---:|---:|---:|---:|---:|
84
- | Røst-315m (this model) | 315M | **6.6%** | **17.0%** | 6.6% ± 0.6% | 16.7% ± 0.8% |
85
- | [chcaa/xls-r-300m-danish-nst-cv9](https://hf.co/chcaa/xls-r-300m-danish-nst-cv9) | 315M | 14.4% ± 0.3% | 36.5% ± 0.6% | **4.1% ± 0.5%** | **12.0% ± 0.8%** |
86
- | [mhenrichsen/hviske](https://hf.co/mhenrichsen/hviske) | 1540M | 14.2% ± 0.5% | 33.2% ± 0.7% | *5.2% ± 0.4%* | *14.2% ± 0.8%* |
87
- | [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | *11.4% ± 0.3%* | *28.3% ± 0.6%* | *5.5% ± 0.4%* | *14.8% ± 0.8%* |
88
- | [openai/whisper-large-v2](https://hf.co/openai/whisper-large-v2) | 1540M | 13.9% ± 0.9% | 32.6% ± 1.2% | 7.2% ± 0.5% | 18.5% ± 0.9% |
89
- | [openai/whisper-large](https://hf.co/openai/whisper-large) | 1540M | 14.5% ± 0.3% | 35.4% ± 0.6% | 9.2% ± 0.5% | 22.9% ± 1.0% |
90
- | [openai/whisper-medium](https://hf.co/openai/whisper-medium) | 764M | 17.2% ± 1.3% | 40.5% ± 2.1% | 9.4% ± 0.5% | 24.0% ± 1.0% |
91
- | [openai/whisper-small](https://hf.co/openai/whisper-small) | 242M | 23.4% ± 1.2% | 55.2% ± 2.3% | 15.9% ± 1.0% | 38.9% ± 1.2% |
92
- | [openai/whisper-base](https://hf.co/openai/whisper-base) | 73M | 43.5% ± 3.1% | 89.3% ± 4.6% | 33.4% ± 4.7% | 71.4% ± 7.0% |
93
- | [openai/whisper-tiny](https://hf.co/openai/whisper-tiny) | 38M | 52.0% ± 2.5% | 103.7% ± 3.5% | 42.2% ± 3.9% | 83.6% ± 2.7% |
94
 
 
95
 
96
- ### Detailed Evaluation Across Demographics on the CoRal Test Set
97
-
98
- ![CER comparison plot](https://filedn.com/lRBwPhPxgV74tO0rDoe8SpH/coral/roest-xlsr-comparison-cer-plot.png)
99
- ![WER comparison plot](https://filedn.com/lRBwPhPxgV74tO0rDoe8SpH/coral/roest-xlsr-comparison-wer-plot.png)
100
-
101
-
102
- ## Training Data
103
-
104
- This model is the result of four different stages of training:
105
-
106
- 1. "Pretraining" on 436,000 hours of unlabelled multilingual publicly available data,
107
- 13,628 hours of which is Danish. Pretraining here means that the model learnt to
108
- "fill in" gaps of raw audio - no transcriptions were used (or available) during
109
- this process. The pretraining data is distributed as follows:
110
- - 372,000 hours from [VoxPopuli](https://aclanthology.org/2021.acl-long.80/), being
111
- speeches from the European Parliament in 23 European languages.
112
- This includes 13,600 hours of Danish speech.
113
- - 51,000 hours from [Multilingual
114
- LibriSpeech](https://doi.org/10.21437/Interspeech.2020-2826), being audiobooks in
115
- 8 European languages. This does not include any Danish speech.
116
- - 7,000 hours from [Common Voice 6](https://doi.org/10.48550/arXiv.1912.06670),
117
- being read-aloud speech in 60 diverse languages. This does not include any Danish
118
- speech.
119
- - 6,600 hours from [VoxLingua107](https://doi.org/10.1109/SLT48900.2021.9383459),
120
- being audio from YouTube videos in 107 languages. This includes 28 hours of
121
- Danish speech.
122
- - 1,000 hours from [BABEL](https://eprints.whiterose.ac.uk/152840/), being
123
- conversational telephone speech in 17 African and Asian languages. This does not
124
- include any Danish speech.
125
- 2. "Finetuning" on 373 hours of labelled Danish publicly available data. "Finetuning"
126
- indicates that this stage of training was supervised, i.e. the model was trained on
127
- both audio and transcriptions to perform the speech-to-text task (also known as
128
- automatic speech recognition). The finetuning data is as follows:
129
- - The read-aloud training split of the [CoRal
130
- dataset](https://huggingface.co/datasets/alexandrainst/coral) (revision
131
- fb20199b3966d3373e0d3a5ded2c5920c70de99c), consisting of 361 hours of Danish
132
- read-aloud speech, diverse across dialects, accents, ages and genders.
133
- - The Danish training split of the [Common Voice 17
134
- dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0),
135
- consisting of 12 hours of Danish read-aloud speech.
136
- 3. An n-gram language model has been trained separately, and is used to guide the
137
- transcription generation of the finetuned speech recognition model. This n-gram
138
- language model has been trained on the following datasets:
139
- - [Danish
140
- Wikipedia](https://huggingface.co/datasets/alexandrainst/scandi-wiki/viewer/da)
141
- (approximately 287,000 articles).
142
- - [Danish Common Voice 17 training
143
- split](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/da)
144
- (approximately 3,500 comments).
145
- - [Danish
146
- Reddit](https://huggingface.co/datasets/alexandrainst/scandi-reddit/viewer/da)
147
- (approximately 5 million comments).
148
- Note that all samples from the CoRal test dataset have been removed from all of
149
- these datasets, to ensure that the n-gram model has not seen the test data.
150
-
151
- The first step was trained by [Babu et al.
152
- (2021)](https://doi.org/10.48550/arXiv.2111.09296) and the second and third step by
153
- [Nielsen et al. (2024)](https://huggingface.co/alexandrainst/roest-315m).
154
-
155
- The final product is then the combination of the finetuned model along with the n-gram
156
- model, and this is what is used when you use the model as mentioned in the Quick Start
157
- section above.
158
-
159
-
160
- ## Intended use cases
161
-
162
- This model is intended to be used for Danish automatic speech recognition.
163
-
164
- Note that Biometric Identification is not allowed using the CoRal dataset and/or derived
165
- models. For more information, see addition 4 in our
166
- [license](https://huggingface.co/datasets/alexandrainst/roest-315m/blob/main/LICENSE).
167
-
168
-
169
- ## Why the name Røst?
170
-
171
- Røst is both the [Danish word for the human
172
- voice](https://ordnet.dk/ddo/ordbog?query=r%C3%B8st) as well as being the name of [one
173
- of the cold-water coral reefs in
174
- Scandinavia](https://da.wikipedia.org/wiki/Koralrev#Koldtvandskoralrev).
175
-
176
-
177
- ## License
178
- The dataset is licensed under a custom license, adapted from OpenRAIL-M, which allows
179
- commercial use with a few restrictions (speech synthesis and biometric identification).
180
- See
181
- [license](https://huggingface.co/datasets/alexandrainst/roest-315m/blob/main/LICENSE).
182
-
183
-
184
- ## Creators and Funders
185
- The CoRal project is funded by the [Danish Innovation
186
- Fund](https://innovationsfonden.dk/) and consists of the following partners:
187
-
188
- - [Alexandra Institute](https://alexandra.dk/)
189
- - [University of Copenhagen](https://www.ku.dk/)
190
- - [Agency for Digital Government](https://digst.dk/)
191
- - [Alvenir](https://www.alvenir.ai/)
192
- - [Corti](https://www.corti.ai/)
193
-
194
-
195
- ## Citation
196
-
197
- We will submit a research paper soon, but until then, if you use this model in your
198
- research or development, please cite it as follows:
199
-
200
- ```bibtex
201
- @dataset{coral2024,
202
- author = {Dan Saattrup Nielsen, Sif Bernstorff Lehmann, Simon Leminen Madsen, Anders Jess Pedersen, Anna Katrine van Zee, Anders Søgaard and Torben Blach},
203
- title = {CoRal: A Diverse Danish ASR Dataset Covering Dialects, Accents, Genders, and Age Groups},
204
- year = {2024},
205
- url = {https://hf.co/datasets/alexandrainst/coral},
206
- }
207
- ```
 
3
  language:
4
  - da
5
  license: openrail
6
+ base_model: facebook/wav2vec2-xls-r-300m
7
+ tags:
8
+ - generated_from_trainer
 
 
 
9
  model-index:
10
+ - name: roest-315m-xlsr
11
+ results: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
+ should probably proofread and complete it, then remove this comment. -->
16
 
17
+ # roest-315m-xlsr
 
18
 
19
+ This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on an unknown dataset.
20
 
21
+ ## Model description
22
 
23
+ More information needed
 
24
 
25
+ ## Intended uses & limitations
 
 
26
 
27
+ More information needed
28
 
29
+ ## Training and evaluation data
 
 
 
 
 
 
30
 
31
+ More information needed
32
 
33
+ ## Training procedure
34
 
35
+ ### Training hyperparameters
 
 
 
 
36
 
37
+ The following hyperparameters were used during training:
38
+ - learning_rate: 0.0001
39
+ - train_batch_size: 256
40
+ - eval_batch_size: 256
41
+ - seed: 4242
42
+ - optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-08
43
+ - lr_scheduler_type: cosine
44
+ - lr_scheduler_warmup_steps: 1000
45
+ - training_steps: 10000
 
 
 
46
 
47
+ ### Framework versions
48
 
49
+ - Transformers 4.44.2
50
+ - Pytorch 2.4.1+cu121
51
+ - Datasets 3.0.0
52
+ - Tokenizers 0.19.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
language_model/3gram.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:321f15f840bab9e5ccb62d7a94174b03c85d01e9bc118d5fbe87dcd1e9a2270c
3
- size 1016037798
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3ec877a2f9dad4e51bfcbdd0e32884b64a7662f722c7f37c77ea91dc3dea65db
3
+ size 750711338
language_model/unigrams.txt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:51866dc0e5e69fa009540e723ca24499600b3767df241f7af9d8ae635128be22
3
- size 39660627
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:683060ef402a6d88def5dc3ff15518b4d44e50ccb7ac12aad81a258d88fb5a72
3
+ size 29668511
vocab.json CHANGED
@@ -1 +1,44 @@
1
- {"0": 0, "1": 1, "2": 2, "3": 3, "4": 4, "5": 5, "6": 6, "7": 7, "8": 8, "9": 9, "a": 10, "b": 11, "c": 12, "d": 13, "e": 14, "f": 15, "g": 16, "h": 17, "i": 18, "j": 19, "k": 20, "l": 21, "m": 22, "n": 23, "o": 24, "p": 25, "q": 26, "r": 27, "s": 28, "t": 29, "u": 30, "v": 31, "w": 32, "x": 33, "y": 34, "z": 35, "|": 36, "\u00e5": 37, "\u00e6": 38, "\u00e9": 39, "\u00f8": 40, "\u00fc": 41}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "0": 0,
3
+ "1": 1,
4
+ "2": 2,
5
+ "3": 3,
6
+ "4": 4,
7
+ "5": 5,
8
+ "6": 6,
9
+ "7": 7,
10
+ "8": 8,
11
+ "9": 9,
12
+ "a": 10,
13
+ "b": 11,
14
+ "c": 12,
15
+ "d": 13,
16
+ "e": 14,
17
+ "f": 15,
18
+ "g": 16,
19
+ "h": 17,
20
+ "i": 18,
21
+ "j": 19,
22
+ "k": 20,
23
+ "l": 21,
24
+ "m": 22,
25
+ "n": 23,
26
+ "o": 24,
27
+ "p": 25,
28
+ "q": 26,
29
+ "r": 27,
30
+ "s": 28,
31
+ "t": 29,
32
+ "u": 30,
33
+ "v": 31,
34
+ "w": 32,
35
+ "x": 33,
36
+ "y": 34,
37
+ "z": 35,
38
+ "|": 36,
39
+ "å": 37,
40
+ "æ": 38,
41
+ "é": 39,
42
+ "ø": 40,
43
+ "ü": 41
44
+ }