khanhld3
[test] init
9302889
|
raw
history blame
5.02 kB
metadata
language: vie
datasets:
  - legacy-datasets/common_voice
  - vlsp2020_vinai_100h
  - AILAB-VNUHCM/vivos
  - doof-ferb/vlsp2020_vinai_100h
  - doof-ferb/fpt_fosd
  - doof-ferb/infore1_25hours
  - linhtran92/viet_bud500
  - doof-ferb/LSVSC
  - doof-ferb/vais1000
  - doof-ferb/VietMed_labeled
  - NhutP/VSV-1100
  - doof-ferb/Speech-MASSIVE_vie
  - doof-ferb/BibleMMS_vie
  - capleaf/viVoice
metrics:
  - wer
pipeline_tag: automatic-speech-recognition
tags:
  - transcription
  - audio
  - speech
  - chunkformer
  - asr
  - automatic-speech-recognition
  - long-form transcription
license: cc-by-nc-4.0
model-index:
  - name: ChunkFormer Large Vietnamese
    results:
      - task:
          name: Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: common-voice-vietnamese
          type: common_voice
          args: vi
        metrics:
          - name: Test WER
            type: wer
            value: x
      - task:
          name: Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: VIVOS
          type: vivos
          args: vi
        metrics:
          - name: Test WER
            type: wer
            value: x
      - task:
          name: Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: VLSP - Task 1
          type: vlsp
          args: vi
        metrics:
          - name: Test WER
            type: wer
            value: x

ChunkFormer-Large-Vie: Large-Scale Pretrained ChunkFormer for Vietnamese Automatic Speech Recognition

License: CC BY-NC 4.0 GitHub Paper


Table of contents

  1. Model Description
  2. Documentation and Implementation
  3. Benchmark Results
  4. Usage
  5. Citation
  6. Contact

Model Description

ChunkFormer-Large-Vie is a large-scale Vietnamese Automatic Speech Recognition (ASR) model based on the innovative ChunkFormer architecture, introduced at ICASSP 2025. The model has been fine-tuned on approximately 2000 hours of public Vietnamese speech data sourced from diverse datasets. A list of datasets can be found HERE.

!!! Please note that only the [train-subset] was used for tuning the model.


Documentation and Implementation

The Documentation and Implementation of ChunkFormer are publicly available.


Benchmark Results

STT Model #Params Vivos Common Voice VLSP - Task 1 Avg.
1 ChunkFormer 110M x x x x
2 PhoWhisper 1.55B 4.67 8.14 13.75 8.85
3 nguyenvulebinh/wav2vec2-base-vietnamese-250h 95M 10.77 18.34 13.33 14.15
4 khanhld/wav2vec2-base-vietnamese-160h 95M 15.05 10.78 x x

Quick Usage

To use the ChunkFormer model for Vietnamese Automatic Speech Recognition, follow these steps:

  1. Download the ChunkFormer Repository
git clone https://github.com/khanld/chunkformer.git
cd chunkformer
pip install -r requirements.txt   
  1. Download the Model Checkpoint from Hugging Face
git lfs install
git clone https://huggingface.co/khanhld/chunkformer-large-vie

This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.

  1. Run the model
python decode.py \
    --model_checkpoint path/to/chunkformer-large-vie \
    --long_form_audio path/to/long_audio.wav \
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128

Citation

If you use this work in your research, please cite:

@inproceedings{your_paper,
  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
  author={Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau},
  booktitle={ICASSP},
  year={2025}
}

Contact