File size: 2,357 Bytes
8e9e5e8 e01589f 8e9e5e8 e01589f 5af41c2 8e9e5e8 e01589f 8e9e5e8 5af41c2 8e9e5e8 5af41c2 8e9e5e8 5af41c2 8e9e5e8 5af41c2 8e9e5e8 5af41c2 8e9e5e8 5af41c2 8e9e5e8 5af41c2 8e9e5e8 5af41c2 8e9e5e8 5af41c2 8e9e5e8 5af41c2 b8d61fa f4d11a0 b8d61fa |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
---
language: vi
datasets:
- VLSP 2020 ASR dataset
- VIVOS
tags:
- audio
- automatic-speech-recognition
license: apache-2.0
widget:
- label: VLSP ASR 2020 test T1
src: https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/audio-test/t1_0001-00010.wav
- label: VLSP ASR 2020 test T1
src: https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/audio-test/t1_utt000000042.wav
- label: VLSP ASR 2020 test T2
src: https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/audio-test/t2_0000006682.wav
---
# Wav2Vec2-Base-250h for the Vietnamese language
[Facebook's Wav2Vec2](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/)
The base model pretrained and fine-tuned on 250 hours of VLSP ASR dataset on 16kHz sampled speech audio. When using the model
make sure that your speech input is also sampled at 16Khz.
# Usage
To transcribe audio files the model can be used as a standalone acoustic model as follows:
```python
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import soundfile as sf
import torch
# load model and tokenizer
processor = Wav2Vec2Processor.from_pretrained("nguyenvulebinh/wav2vec2-base-vietnamese-250h")
model = Wav2Vec2ForCTC.from_pretrained("nguyenvulebinh/wav2vec2-base-vietnamese-250h")
# define function to read in sound file
def map_to_array(batch):
speech, _ = sf.read(batch["file"])
batch["speech"] = speech
return batch
# load dummy dataset and read soundfiles
ds = map_to_array({
"file": 'audio-test/t1_0001-00010.wav'
})
# tokenize
input_values = processor(ds["speech"], return_tensors="pt", padding="longest").input_values # Batch size 1
# retrieve logits
logits = model(input_values).logits
# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
```
*Result WER (with 4-grams LM)*:
| "VIVOS" | "VLSP-T1" | "VLSP-T2" |
|---|---|---|
| 6.1 | 9.1 | 40.8 |
# License
This model follows [CC-BY-NC-4.0](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/CC-BY-NC-SA-4.0.txt) license. Therefore, those compounds are freely available for academic purposes or individual research but restricted for commercial use.
# Contact
[email protected] |