File size: 3,938 Bytes
b1e60c7
 
f30644f
 
 
 
 
 
 
 
 
 
 
 
 
 
a1b2776
 
 
 
f30644f
 
 
5dfec46
b1e60c7
f30644f
6d8687e
f30644f
 
 
e3c955d
f30644f
e3c955d
f30644f
 
5dfec46
 
 
 
 
f30644f
5dfec46
f30644f
 
 
 
eb8cb9e
 
f30644f
 
 
 
93616ae
f30644f
93616ae
 
 
f30644f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5dfec46
 
 
 
 
 
f30644f
5dfec46
f30644f
 
 
 
 
 
eb8cb9e
 
f30644f
 
 
 
 
 
 
 
93616ae
 
f30644f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5dfec46
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
license: apache-2.0
language: ta
tags:
- audio
- automatic-speech-recognition
- speech
- xlsr-fine-tuning-week
- hf-asr-leaderboard
- tamil language
model-index:
- name: XLSR Wav2Vec2 Tamil by Manan Dey
  results:
  - task: 
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Common Voice ta
      type: common_voice
      args: ta
    metrics:
       - name: Test WER
         type: wer
         value: 57.004356
---

# Wav2Vec2-Large-XLSR-Tamil

When using this model, make sure that your speech input is sampled at 16kHz.

## Inference

The model can be used directly as follows:

```python
!pip install datasets 
!pip install transformers 

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

import torch
import librosa
from datasets import load_dataset

test_dataset = load_dataset("common_voice", "ta", split="test[:2%]").

processor = Wav2Vec2Processor.from_pretrained("Gobee/Wav2vec2-Large-XLSR-Tamil")
model = Wav2Vec2ForCTC.from_pretrained("Gobee/Wav2vec2-Large-XLSR-Tamil")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])
```


## Evaluation

The model can be evaluated as follows on the {language} test data of Common Voice.


```python
!pip install datasets 
!pip install transformers 
!pip install jiwer

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

import torch
import librosa
from datasets import load_dataset, load_metric
import re

test_dataset = load_dataset("common_voice", "ta", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("Gobee/Wav2vec2-Large-XLSR-Tamil")
model = Wav2Vec2ForCTC.from_pretrained("Gobee/Wav2vec2-Large-XLSR-Tamil")
model.to("cuda")

chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\ \’\–\(\)]'

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
```

**Test Result**: 57.004356 %

## Usage and Evaluation script

The script used for usage and evaluation can be found [here](https://colab.research.google.com/drive/1dyDe14iOmoNoVHDJTkg-hAgLnrGdI-Dk?usp=share_link)

## Training

The Common Voice `train`, `validation` datasets were used for training.

The script used for training can be found [here](https://colab.research.google.com/drive/1-Klkgr4f-C9SanHfVC5RhP0ELUH6TYlN?usp=sharing)