patrickvonplaten commited on
Commit
f4ac233
1 Parent(s): f7e2aa3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -61
README.md CHANGED
@@ -3,13 +3,17 @@ language: en
3
  datasets:
4
  - librispeech_asr
5
  tags:
6
- - speech
7
  - audio
8
  - automatic-speech-recognition
9
  - hf-asr-leaderboard
10
  license: apache-2.0
 
 
 
 
 
11
  model-index:
12
- - name: wav2vec2-large-960h-lv60
13
  results:
14
  - task:
15
  name: Automatic Speech Recognition
@@ -21,89 +25,50 @@ model-index:
21
  metrics:
22
  - name: Test WER
23
  type: wer
24
- value: 1.9
25
  ---
26
 
27
- # Wav2Vec2-Large-960h-Lv60 + Self-Training + 4-gram
28
 
29
- [Facebook's Wav2Vec2](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/)
30
-
31
- The large model pretrained and fine-tuned on 960 hours of Libri-Light and Librispeech on 16kHz sampled speech audio. Model was trained with [Self-Training objective](https://arxiv.org/abs/2010.11430). When using the model make sure that your speech input is also sampled at 16Khz.
32
-
33
- [Paper](https://arxiv.org/abs/2006.11477)
34
-
35
- Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli
36
-
37
- **Abstract**
38
-
39
- We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data.
40
-
41
- The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20.
42
-
43
-
44
- # Usage
45
-
46
- To transcribe audio files the model can be used as a standalone acoustic model as follows:
47
-
48
- ```python
49
- from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
50
- from datasets import load_dataset
51
- import torch
52
-
53
- # load model and processor
54
- processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
55
- model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
56
-
57
- # load dummy dataset and read soundfiles
58
- ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
59
-
60
- # tokenize
61
- input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values
62
 
63
- # retrieve logits
64
- logits = model(input_values).logits
65
 
66
- # take argmax and decode
67
- predicted_ids = torch.argmax(logits, dim=-1)
68
- transcription = processor.batch_decode(predicted_ids)
69
- ```
70
-
71
- ## Evaluation
72
-
73
- This code snippet shows how to evaluate **facebook/wav2vec2-large-960h-lv60-self** on LibriSpeech's "clean" and "other" test data.
74
 
75
  ```python
76
  from datasets import load_dataset
77
- from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
78
  import torch
79
  from jiwer import wer
80
 
 
81
 
82
- librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")
83
 
84
- model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60-self").to("cuda")
85
- processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
86
 
87
  def map_to_pred(batch):
88
- inputs = processor(batch["audio"]["array"], return_tensors="pt", padding="longest")
89
- input_values = inputs.input_values.to("cuda")
90
- attention_mask = inputs.attention_mask.to("cuda")
91
-
92
  with torch.no_grad():
93
- logits = model(input_values, attention_mask=attention_mask).logits
94
 
95
- predicted_ids = torch.argmax(logits, dim=-1)
96
- transcription = processor.batch_decode(predicted_ids)
97
  batch["transcription"] = transcription
98
  return batch
99
 
100
- result = librispeech_eval.map(map_to_pred, remove_columns=["speech"])
101
 
102
- print("WER:", wer(result["text"], result["transcription"]))
103
  ```
104
 
105
  *Result (WER)*:
106
 
107
  | "clean" | "other" |
108
  |---|---|
109
- | 1.9 | 3.9 |
 
3
  datasets:
4
  - librispeech_asr
5
  tags:
 
6
  - audio
7
  - automatic-speech-recognition
8
  - hf-asr-leaderboard
9
  license: apache-2.0
10
+ widget:
11
+ - example_title: Librispeech sample 1
12
+ src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
13
+ - example_title: Librispeech sample 2
14
+ src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
15
  model-index:
16
+ - name: patrickvonplaten/wav2vec2-large-960h-lv60-self-4-gram
17
  results:
18
  - task:
19
  name: Automatic Speech Recognition
 
25
  metrics:
26
  - name: Test WER
27
  type: wer
28
+ value: 2.59
29
  ---
30
 
31
+ # Wav2Vec2-Base-960h + 4-gram
32
 
33
+ This model is identical to [Facebook's Wav2Vec2-Large-960h-lv60-self](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self), but is
34
+ augmented with an English 4-gram. The `4-gram.arpa.gz` of [Librispeech's official ngrams](https://www.openslr.org/11) is used.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
+ ## Evaluation
 
37
 
38
+ This code snippet shows how to evaluate **patrickvonplaten/wav2vec2-large-960h-lv60-self-4-gram** on LibriSpeech's "clean" and "other" test data.
 
 
 
 
 
 
 
39
 
40
  ```python
41
  from datasets import load_dataset
42
+ from transformers import AutoModelForCTC, AutoProcessor
43
  import torch
44
  from jiwer import wer
45
 
46
+ model_id = "patrickvonplaten/wav2vec2-large-960h-lv60-self-4-gram"
47
 
48
+ librispeech_eval = load_dataset("librispeech_asr", "other", split="test")
49
 
50
+ model = AutoModelForCTC.from_pretrained(model_id).to("cuda")
51
+ processor = AutoProcessor.from_pretrained(model_id)
52
 
53
  def map_to_pred(batch):
54
+ inputs = processor(batch["audio"]["array"], sampling_rate=16_000, return_tensors="pt")
55
+
56
+ inputs = {k: v.to("cuda") for k,v in inputs.items()}
57
+
58
  with torch.no_grad():
59
+ logits = model(**inputs).logits
60
 
61
+ transcription = processor.batch_decode(logits.cpu().numpy()).text[0]
 
62
  batch["transcription"] = transcription
63
  return batch
64
 
65
+ result = librispeech_eval.map(map_to_pred, remove_columns=["audio"])
66
 
67
+ print(wer(result["text"], result["transcription"]))
68
  ```
69
 
70
  *Result (WER)*:
71
 
72
  | "clean" | "other" |
73
  |---|---|
74
+ | 1.84 | 3.71 |