jonatasgrosman
commited on
Commit
•
ea576d0
1
Parent(s):
30c5623
update README
Browse files
README.md
CHANGED
@@ -49,7 +49,7 @@ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
|
|
49 |
|
50 |
LANG_ID = "fr"
|
51 |
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-french"
|
52 |
-
SAMPLES =
|
53 |
|
54 |
test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
|
55 |
|
@@ -86,6 +86,11 @@ for i, predicted_sentence in enumerate(predicted_sentences):
|
|
86 |
| "J'AI DIT QUE LES ACTEURS DE BOIS AVAIENT, SELON MOI, BEAUCOUP D'AVANTAGES SUR LES AUTRES." | JAI DIT QUE LES ACTEURS DE BOIS AVAIENT SELON MOI BEAUCOUP DAVANTAGES SUR LES AUTRES |
|
87 |
| LES PAYS-BAS ONT REMPORTÉ TOUTES LES ÉDITIONS. | LE PAYS-BAS AN REMPORTAIT TOUTES LES ÉDITIONS |
|
88 |
| IL Y A MAINTENANT UNE GARE ROUTIÈRE. | IL A MA ANDIN GARD DETIRON |
|
|
|
|
|
|
|
|
|
|
|
89 |
|
90 |
## Evaluation
|
91 |
|
@@ -102,9 +107,11 @@ LANG_ID = "fr"
|
|
102 |
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-french"
|
103 |
DEVICE = "cuda"
|
104 |
|
105 |
-
CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
|
106 |
"؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
|
107 |
-
"=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。"
|
|
|
|
|
108 |
|
109 |
test_dataset = load_dataset("common_voice", LANG_ID, split="test")
|
110 |
|
@@ -152,11 +159,15 @@ print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_
|
|
152 |
|
153 |
**Test Result**:
|
154 |
|
155 |
-
In the table below I report the Word Error Rate (WER) and the Character Error Rate (CER) of the model. I ran the evaluation script described above on other models as well (on 2021-
|
156 |
|
157 |
| Model | WER | CER |
|
158 |
| ------------- | ------------- | ------------- |
|
159 |
| jonatasgrosman/wav2vec2-large-xlsr-53-french | **16.86%** | **5.65%** |
|
160 |
-
| Ilyes/wav2vec2-large-xlsr-53-french |
|
|
|
|
|
161 |
| facebook/wav2vec2-large-xlsr-53-french | 25.45% | 10.35% |
|
162 |
| MehdiHosseiniMoghadam/wav2vec2-large-xlsr-53-French | 28.22% | 9.70% |
|
|
|
|
|
|
49 |
|
50 |
LANG_ID = "fr"
|
51 |
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-french"
|
52 |
+
SAMPLES = 10
|
53 |
|
54 |
test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
|
55 |
|
|
|
86 |
| "J'AI DIT QUE LES ACTEURS DE BOIS AVAIENT, SELON MOI, BEAUCOUP D'AVANTAGES SUR LES AUTRES." | JAI DIT QUE LES ACTEURS DE BOIS AVAIENT SELON MOI BEAUCOUP DAVANTAGES SUR LES AUTRES |
|
87 |
| LES PAYS-BAS ONT REMPORTÉ TOUTES LES ÉDITIONS. | LE PAYS-BAS AN REMPORTAIT TOUTES LES ÉDITIONS |
|
88 |
| IL Y A MAINTENANT UNE GARE ROUTIÈRE. | IL A MA ANDIN GARD DETIRON |
|
89 |
+
| HUIT | HUIT |
|
90 |
+
| DANS L’ATTENTE DU LENDEMAIN, ILS NE POUVAIENT SE DÉFENDRE D’UNE VIVE ÉMOTION | DANS L'ATTENTE DU LENDEMAIN IL NE POUVAIT SE DÉFENDRE D'UNE VIVE ÉMOTION |
|
91 |
+
| LA PREMIÈRE SAISON EST COMPOSÉE DE DOUZE ÉPISODES. | LA PREMIÈRE SAISON EST COMPOSÉE DE DOUX ÉPISODES |
|
92 |
+
| ELLE SE TROUVE ÉGALEMENT DANS LES ÎLES BRITANNIQUES. | ELLE SE TROUVE ÉGALEMENT DANS LES ÎLES BRITANNIQUES |
|
93 |
+
| ZÉRO | ZÉRO |
|
94 |
|
95 |
## Evaluation
|
96 |
|
|
|
107 |
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-french"
|
108 |
DEVICE = "cuda"
|
109 |
|
110 |
+
CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", ";", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
|
111 |
"؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
|
112 |
+
"{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
|
113 |
+
"、", "﹂", "﹁", "‧", "~", "﹏", ",", "{", "}", "(", ")", "[", "]", "【", "】", "‥", "〽",
|
114 |
+
"『", "』", "〝", "〟", "⟨", "⟩", "〜", ":", "!", "?", "♪", "؛", "/", "\\", "º", "−", "^", "ʻ", "ˆ"]
|
115 |
|
116 |
test_dataset = load_dataset("common_voice", LANG_ID, split="test")
|
117 |
|
|
|
159 |
|
160 |
**Test Result**:
|
161 |
|
162 |
+
In the table below I report the Word Error Rate (WER) and the Character Error Rate (CER) of the model. I ran the evaluation script described above on other models as well (on 2021-05-16). Note that the table below may show different results from those already reported, this may have been caused due to some specificity of the other evaluation scripts used.
|
163 |
|
164 |
| Model | WER | CER |
|
165 |
| ------------- | ------------- | ------------- |
|
166 |
| jonatasgrosman/wav2vec2-large-xlsr-53-french | **16.86%** | **5.65%** |
|
167 |
+
| Ilyes/wav2vec2-large-xlsr-53-french | 19.67% | 6.70% |
|
168 |
+
| jonatasgrosman/wav2vec2-large-fr-voxpopuli-french | 19.80% | 6.89% |
|
169 |
+
| Nhut/wav2vec2-large-xlsr-french | 24.09% | 8.42% |
|
170 |
| facebook/wav2vec2-large-xlsr-53-french | 25.45% | 10.35% |
|
171 |
| MehdiHosseiniMoghadam/wav2vec2-large-xlsr-53-French | 28.22% | 9.70% |
|
172 |
+
| Ilyes/wav2vec2-large-xlsr-53-french_punctuation | 29.80% | 11.79% |
|
173 |
+
| facebook/wav2vec2-base-10k-voxpopuli-ft-fr | 61.06% | 33.31% |
|