khanhld commited on
Commit
dde0efe
·
1 Parent(s): 75f5473

update readme

Browse files
README.md CHANGED
@@ -3,6 +3,8 @@ language: vi
3
  datasets:
4
  - vivos
5
  - common_voice
 
 
6
  metrics:
7
  - wer
8
  pipeline_tag: automatic-speech-recognition
@@ -10,21 +12,17 @@ tags:
10
  - audio
11
  - speech
12
  - Transformer
 
 
13
  license: cc-by-nc-4.0
 
 
 
 
 
14
  model-index:
15
  - name: Wav2vec2 Base Vietnamese 160h
16
  results:
17
- - task:
18
- name: Speech Recognition
19
- type: automatic-speech-recognition
20
- dataset:
21
- name: Common Voice vi
22
- type: common_voice
23
- args: vi
24
- metrics:
25
- - name: Test WER
26
- type: wer
27
- value: 0
28
  - task:
29
  name: Speech Recognition
30
  type: automatic-speech-recognition
@@ -35,7 +33,7 @@ model-index:
35
  metrics:
36
  - name: Test WER
37
  type: wer
38
- value: 0
39
  - task:
40
  name: Speech Recognition
41
  type: automatic-speech-recognition
@@ -46,60 +44,107 @@ model-index:
46
  metrics:
47
  - name: Test WER
48
  type: wer
49
- value: 0
50
  ---
51
 
52
- # FINETUNE WAV2VEC 2.0 FOR SPEECH RECOGNITION
53
- ## Table of contents
54
- 1. [Documentation](#documentation)
55
- 2. [Installation](#installation)
56
- 3. [Usage](#usage)
57
- 4. [Logs and Visualization](#logs)
 
58
 
59
- <a name = "documentation" ></a>
60
- ## Documentation
61
- Suppose you need a simple way to fine-tune the Wav2vec 2.0 model for the task of Speech Recognition on your datasets, then you came to the right place.
 
 
 
 
62
  </br>
63
- All documents related to this repo can be found here:
64
- - [Wav2vec2ForCTC](https://huggingface.co/docs/transformers/model_doc/wav2vec2#transformers.Wav2Vec2ForCTC)
65
- - [Tutorial](https://huggingface.co/blog/fine-tune-wav2vec2-english)
66
- - [Code reference](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py)
67
 
 
 
 
 
 
 
68
 
69
- <a name = "installation" ></a>
70
- ## Installation
71
- ```
72
- pip install -r requirements.txt
73
- ```
 
 
 
 
 
74
 
75
- <a name = "usage" ></a>
76
- ## Usage
77
- 1. Prepare your dataset
78
- - Your dataset can be in <b>.txt</b> or <b>.csv</b> format.
79
- - <b>path</b> and <b>transcript</b> columns are compulsory. The <b>path</b> column contains the paths to your stored audio files, depending on your dataset location, it can be either absolute paths or relative paths. The <b>transcript</b> column contains the corresponding transcripts to the audio paths.
80
- - Check out our [data_example.csv](dataset/data_example.csv) file for more information.
81
- 2. Configure the config.toml file
82
- 3. Run
83
- - Start training:
84
- ```
85
- python train.py -c config.toml
86
- ```
87
- - Continue to train from resume:
88
- ```
89
- python train.py -c config.toml -r
90
- ```
91
- - Load specific model and start training:
92
- ```
93
- python train.py -c config.toml -p path/to/your/model.tar
94
- ```
95
-
96
- <a name = "logs" ></a>
97
- ## Logs and Visualization
98
- The logs during the training will be stored, and you can visualize it using TensorBoard by running this command:
99
  ```
100
- # specify the <name> in config.json
101
- tensorboard --logdir ~/saved/<name>
102
 
103
- # specify a port 8080
104
- tensorboard --logdir ~/saved/<name> --port 8080
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
105
  ```
 
 
 
 
 
 
 
 
 
 
 
3
  datasets:
4
  - vivos
5
  - common_voice
6
+ - fpt
7
+ - vlsp 100h
8
  metrics:
9
  - wer
10
  pipeline_tag: automatic-speech-recognition
 
12
  - audio
13
  - speech
14
  - Transformer
15
+ - wav2vec2
16
+ - automatic-speech-recognition
17
  license: cc-by-nc-4.0
18
+ widget:
19
+ - example_title: common_voice example
20
+ src: examples/common_voice_vi_30519757.mp3
21
+ - example_title: vivos example
22
+ src: examples/VIVOSDEV02_R005.wav
23
  model-index:
24
  - name: Wav2vec2 Base Vietnamese 160h
25
  results:
 
 
 
 
 
 
 
 
 
 
 
26
  - task:
27
  name: Speech Recognition
28
  type: automatic-speech-recognition
 
33
  metrics:
34
  - name: Test WER
35
  type: wer
36
+ value: 10.78
37
  - task:
38
  name: Speech Recognition
39
  type: automatic-speech-recognition
 
44
  metrics:
45
  - name: Test WER
46
  type: wer
47
+ value: 15.05
48
  ---
49
 
50
+ # Vietnamese Speech Recognition using Wav2vec 2.0
51
+ ### Table of contents
52
+ 1. [Model Description](#description)
53
+ 2. [Benchmark Result](#benchmark)
54
+ 3. [Example Usage](#example)
55
+ 4. [Evaluation](#evaluation)
56
+ 5. [Contact](#contact)
57
 
58
+ <a name = "description" ></a>
59
+ ### Model Description
60
+ Fine-tune the Wav2vec2-based model on about 160 hours of Vietnamese speech dataset from different resources including [VIOS](https://huggingface.co/datasets/vivos), [COMMON VOICE](https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0), [FPT](https://data.mendeley.com/datasets/k9sxg2twv4/4) and [VLSP 100h](https://drive.google.com/file/d/1vUSxdORDxk-ePUt-bUVDahpoXiqKchMx/view). We have not yet incorporated the Language Model (which will be included in future work) into our ASR system but still gained a promising result.
61
+ <br>
62
+ We also provide code for Pre-training and Fine-tuning the Wav2vec2 model (not available for now but will release soon). If you wish to train on your dataset, check it out here:
63
+ 1. [Pretrain](https://github.com/khanld/ASR-Wav2vec-Pretrain)
64
+ 2. [Finetune](https://github.com/khanld/ASR-Wa2vec-Finetune)
65
  </br>
 
 
 
 
66
 
67
+ <a name = "benchmark" ></a>
68
+ ### Benchmark WER Result
69
+ | | [VIVOS](https://huggingface.co/datasets/vivos) | [COMMON VOICE 8.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0) |
70
+ |---|---|---|
71
+ |without LM| 15.05 | 10.78 |
72
+ |with LM| in progress | in progress |
73
 
74
+ <a name = "example" ></a>
75
+ ### Example Usage
76
+ ```python
77
+ from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
78
+ import librosa
79
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
80
+
81
+ processor = Wav2Vec2Processor.from_pretrained("khanhld/wav2vec2-base-vietnamese-160h")
82
+ model = Wav2Vec2ForCTC.from_pretrained("khanhld/wav2vec2-base-vietnamese-160h")
83
+ model.to(device)
84
 
85
+ def transcribe(wav):
86
+ input_values = processor(wav, sampling_rate=16000, return_tensors="pt").input_values
87
+ logits = model(input_values.to(device)).logits
88
+ pred_ids = torch.argmax(logits, dim=-1)
89
+ pred_transcript = processor.batch_decode(pred_ids)[0]
90
+ return pred_transcript
91
+
92
+
93
+ wav, _ = librosa.load('path/to/your/audio/file', sr = 16000)
94
+ print(f"transcript: {transcribe(wav)}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
  ```
 
 
96
 
97
+ <a name = "evaluation"></a>
98
+ ### Evaluation
99
+ ```python
100
+ from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
101
+ from datasets import load_dataset
102
+ import torch
103
+ import re
104
+ from datasets import load_dataset, load_metric, Audio
105
+
106
+ wer = load_metric("wer")
107
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
108
+
109
+ # load processor and model
110
+ processor = Wav2Vec2Processor.from_pretrained("khanhld/wav2vec2-base-vietnamese-160h")
111
+ model = Wav2Vec2ForCTC.from_pretrained("khanhld/wav2vec2-base-vietnamese-160h")
112
+ model.to(device)
113
+ model.eval()
114
+
115
+ # Load dataset
116
+ test_dataset = load_dataset("mozilla-foundation/common_voice_8_0", "vi", split="test")
117
+ test_dataset = test_dataset.cast_column("audio", Audio(sampling_rate=16000))
118
+ chars_to_ignore = r'[,?.!\-;:"“%\'�]' # ignore special characters
119
+
120
+ # preprocess data
121
+ def preprocess(batch):
122
+ audio = batch["audio"]
123
+ batch["input_values"] = audio["array"]
124
+ batch["transcript"] = re.sub(chars_to_ignore, '', batch["sentence"]).lower()
125
+ return batch
126
+
127
+ # run inference
128
+ def inference(batch):
129
+ input_values = processor(batch["input_values"],
130
+ sampling_rate=16000,
131
+ return_tensors="pt").input_values
132
+ logits = model(input_values.to(device)).logits
133
+ pred_ids = torch.argmax(logits, dim=-1)
134
+ batch["pred_transcript"] = processor.batch_decode(pred_ids)
135
+ return batch
136
+
137
+ test_dataset = test_dataset.map(preprocess)
138
+ result = test_dataset.map(inference, batched=True, batch_size=1)
139
+ print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_transcript"], references=result["transcript"])))
140
  ```
141
+ **Test Result**: 10.78%
142
+
143
+ <a name = "contact"></a>
144
+ ### Contact
145
146
+ </br>
147
+ [![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/)<br>
148
+ [![LinkedIn](https://img.shields.io/badge/linkedin-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/khanhld257/)
149
+
150
+
examples/VIVOSDEV02_R005.wav ADDED
Binary file (84 kB). View file
 
examples/common_voice_vi_30519757.mp3 ADDED
Binary file (27.7 kB). View file