metadata
language: vie
datasets:
- legacy-datasets/common_voice
- vlsp2020_vinai_100h
- AILAB-VNUHCM/vivos
- doof-ferb/vlsp2020_vinai_100h
- doof-ferb/fpt_fosd
- doof-ferb/infore1_25hours
- linhtran92/viet_bud500
- doof-ferb/LSVSC
- doof-ferb/vais1000
- doof-ferb/VietMed_labeled
- NhutP/VSV-1100
- doof-ferb/Speech-MASSIVE_vie
- doof-ferb/BibleMMS_vie
- capleaf/viVoice
metrics:
- wer
pipeline_tag: automatic-speech-recognition
tags:
- transcription
- audio
- speech
- chunkformer
- asr
- automatic-speech-recognition
- long-form transcription
license: cc-by-nc-4.0
model-index:
- name: ChunkFormer Large Vietnamese
results:
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: common-voice-vietnamese
type: common_voice
args: vi
metrics:
- name: Test WER
type: wer
value: x
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: VIVOS
type: vivos
args: vi
metrics:
- name: Test WER
type: wer
value: x
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: VLSP - Task 1
type: vlsp
args: vi
metrics:
- name: Test WER
type: wer
value: x
ChunkFormer-Large-Vie: Large-Scale Pretrained ChunkFormer for Vietnamese Automatic Speech Recognition
Table of contents
Model Description
ChunkFormer-Large-Vie is a large-scale Vietnamese Automatic Speech Recognition (ASR) model based on the innovative ChunkFormer architecture, introduced at ICASSP 2025. The model has been fine-tuned on approximately 2000 hours of public Vietnamese speech data sourced from diverse datasets. A list of datasets can be found HERE.
!!! Please note that only the [train-subset] was used for tuning the model.
Documentation and Implementation
The Documentation and Implementation of ChunkFormer are publicly available.
Benchmark Results
STT | Model | #Params | Vivos | Common Voice | VLSP - Task 1 | Avg. |
---|---|---|---|---|---|---|
1 | ChunkFormer | 110M | x | x | x | x |
2 | PhoWhisper | 1.55B | 4.67 | 8.14 | 13.75 | 8.85 |
3 | nguyenvulebinh/wav2vec2-base-vietnamese-250h | 95M | 10.77 | 18.34 | 13.33 | 14.15 |
4 | khanhld/wav2vec2-base-vietnamese-160h | 95M | 15.05 | 10.78 | x | x |
Quick Usage
To use the ChunkFormer model for Vietnamese Automatic Speech Recognition, follow these steps:
- Download the ChunkFormer Repository
git clone https://github.com/khanld/chunkformer.git
cd chunkformer
pip install -r requirements.txt
- Download the Model Checkpoint from Hugging Face
git lfs install
git clone https://huggingface.co/khanhld/chunkformer-large-vie
This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.
- Run the model
python decode.py \
--model_checkpoint path/to/chunkformer-large-vie \
--long_form_audio path/to/long_audio.wav \
--chunk_size 64 \
--left_context_size 128 \
--right_context_size 128
Citation
If you use this work in your research, please cite:
@inproceedings{your_paper,
title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
author={Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau},
booktitle={ICASSP},
year={2025}
}