Automatic Speech Recognition
Transformers
Safetensors
Vietnamese
whisper
Inference Endpoints
NhutP commited on
Commit
bc666ab
1 Parent(s): b6175ef

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +79 -3
README.md CHANGED
@@ -1,3 +1,79 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: mit
4
+ datasets:
5
+ - NhutP/VSV-1100
6
+ - mozilla-foundation/common_voice_14_0
7
+ - AILAB-VNUHCM/vivos
8
+ language:
9
+ - vi
10
+ metrics:
11
+ - wer
12
+ base_model:
13
+ - openai/whisper-medium
14
+ ---
15
+ ## Introduction
16
+ - We release a new model for Vietnamese speech regconition task.
17
+ - We fine-tuned [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) on our new dataset [VSV-1100](https://huggingface.co/datasets/NhutP/VSV-1100).
18
+
19
+ ## Training data
20
+
21
+ | [VSV-1100](https://huggingface.co/datasets/NhutP/VSV-1100) | T2S* | [CMV14-vi](https://huggingface.co/datasets/mozilla-foundation/common_voice_14_0) |[VIVOS](https://huggingface.co/datasets/AILAB-VNUHCM/vivos)| [VLSP2021](https://vlsp.org.vn/index.php/resources) | Total|
22
+ |:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
23
+ | 1100 hours | 11 hours | 3.04 hours | 13.94 hours| 180 hours | 1308 hours |
24
+
25
+ \* We use a text-to-speech model to generate sentences containing words that do not appear in our dataset.
26
+
27
+ ## WER result
28
+ | [CMV14-vi](https://huggingface.co/datasets/mozilla-foundation/common_voice_14_0) | [VIVOS](https://huggingface.co/datasets/AILAB-VNUHCM/vivos) | [VLSP2020-T1](https://vlsp.org.vn/index.php/resources) | [VLSP2020-T2](https://vlsp.org.vn/index.php/resources) | [VLSP2021-T1](https://vlsp.org.vn/index.php/resources) | [VLSP2021-T2](https://vlsp.org.vn/index.php/resources) |[Bud500](https://huggingface.co/datasets/linhtran92/viet_bud500) |
29
+ |:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
30
+ |8.1|4.69|13.22|28.76| 11.78 | 8.28 | 5.38 |
31
+
32
+
33
+
34
+ ## Usage
35
+ ### Inference
36
+ ```python
37
+ from transformers import WhisperProcessor, WhisperForConditionalGeneration
38
+ import librosa
39
+ # load model and processor
40
+ processor = WhisperProcessor.from_pretrained("NhutP/ViWhisper-medium")
41
+ model = WhisperForConditionalGeneration.from_pretrained("NhutP/ViWhisper-medium")
42
+ model.config.forced_decoder_ids = None
43
+
44
+ # load a sample
45
+ array, sampling_rate = librosa.load('path_to_audio', sr = 16000) # Load some audio sample
46
+ input_features = processor(array, sampling_rate=sampling_rate, return_tensors="pt").input_features
47
+ # generate token ids
48
+ predicted_ids = model.generate(input_features)
49
+ # decode token ids to text
50
+ transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
51
+ ```
52
+ ### Use with pipeline
53
+ ```python
54
+ from transformers import pipeline
55
+ pipe = pipeline(
56
+ "automatic-speech-recognition",
57
+ model="NhutP/ViWhisper-medium",
58
+ max_new_tokens=128,
59
+ chunk_length_s=30,
60
+ return_timestamps=False,
61
+ device= '...' # 'cpu' or 'cuda'
62
+ )
63
+ output = pipe(path_to_audio_samplingrate_16000)['text']
64
+ ```
65
+
66
+ ## Citation
67
+
68
+ ```
69
+ @misc{VSV-1100,
70
+ author = {Pham Quang Nhut and Duong Pham Hoang Anh and Nguyen Vinh Tiep},
71
+ title = {VSV-1100: Vietnamese social voice dataset},
72
+ url = {https://github.com/NhutP/VSV-1100},
73
+ year = {2024}
74
+ }
75
+ ```
76
+
77
+ Also, please give us a star on github: https://github.com/NhutP/ViWhisper if you find our project useful
78
+
79
+ Contact me at: [email protected] (Pham Quang Nhut)