|
--- |
|
library_name: transformers |
|
license: mit |
|
datasets: |
|
- NhutP/VSV-1100 |
|
- mozilla-foundation/common_voice_14_0 |
|
- AILAB-VNUHCM/vivos |
|
language: |
|
- vi |
|
metrics: |
|
- wer |
|
base_model: |
|
- openai/whisper-medium |
|
--- |
|
## Introduction |
|
- We release a new model for Vietnamese speech regconition task. |
|
- We fine-tuned [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) on our new dataset [VSV-1100](https://huggingface.co/datasets/NhutP/VSV-1100). |
|
|
|
## Training data |
|
|
|
| [VSV-1100](https://huggingface.co/datasets/NhutP/VSV-1100) | T2S* | [CMV14-vi](https://huggingface.co/datasets/mozilla-foundation/common_voice_14_0) |[VIVOS](https://huggingface.co/datasets/AILAB-VNUHCM/vivos)| [VLSP2021](https://vlsp.org.vn/index.php/resources) | Total| |
|
|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:| |
|
| 1100 hours | 11 hours | 3.04 hours | 13.94 hours| 180 hours | 1308 hours | |
|
|
|
\* We use a text-to-speech model to generate sentences containing words that do not appear in our dataset. |
|
|
|
## WER result |
|
| [CMV14-vi](https://huggingface.co/datasets/mozilla-foundation/common_voice_14_0) | [VIVOS](https://huggingface.co/datasets/AILAB-VNUHCM/vivos) | [VLSP2020-T1](https://vlsp.org.vn/index.php/resources) | [VLSP2020-T2](https://vlsp.org.vn/index.php/resources) | [VLSP2021-T1](https://vlsp.org.vn/index.php/resources) | [VLSP2021-T2](https://vlsp.org.vn/index.php/resources) |[Bud500](https://huggingface.co/datasets/linhtran92/viet_bud500) | |
|
|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:| |
|
|8.1|4.69|13.22|28.76| 11.78 | 8.28 | 5.38 | |
|
|
|
|
|
|
|
## Usage |
|
### Inference |
|
```python |
|
from transformers import WhisperProcessor, WhisperForConditionalGeneration |
|
import librosa |
|
# load model and processor |
|
processor = WhisperProcessor.from_pretrained("NhutP/ViWhisper-medium") |
|
model = WhisperForConditionalGeneration.from_pretrained("NhutP/ViWhisper-medium") |
|
model.config.forced_decoder_ids = None |
|
|
|
# load a sample |
|
array, sampling_rate = librosa.load('path_to_audio', sr = 16000) # Load some audio sample |
|
input_features = processor(array, sampling_rate=sampling_rate, return_tensors="pt").input_features |
|
# generate token ids |
|
predicted_ids = model.generate(input_features) |
|
# decode token ids to text |
|
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) |
|
``` |
|
### Use with pipeline |
|
```python |
|
from transformers import pipeline |
|
pipe = pipeline( |
|
"automatic-speech-recognition", |
|
model="NhutP/ViWhisper-medium", |
|
max_new_tokens=128, |
|
chunk_length_s=30, |
|
return_timestamps=False, |
|
device= '...' # 'cpu' or 'cuda' |
|
) |
|
output = pipe(path_to_audio_samplingrate_16000)['text'] |
|
``` |
|
|
|
## Citation |
|
|
|
``` |
|
@misc{VSV-1100, |
|
author = {Pham Quang Nhut and Duong Pham Hoang Anh and Nguyen Vinh Tiep}, |
|
title = {VSV-1100: Vietnamese social voice dataset}, |
|
url = {https://github.com/NhutP/VSV-1100}, |
|
year = {2024} |
|
} |
|
``` |
|
|
|
Also, please give us a star on github: https://github.com/NhutP/ViWhisper if you find our project useful |
|
|
|
Contact me at: [email protected] (Pham Quang Nhut) |