File size: 5,016 Bytes

1c46201
d6f86ad
f1cec28
24a411a
c722643
 
 
 
 
 
 
 
 
 
 
 
 
f1cec28
 
 
 
 
 
 
 
 
 
9fff5a2
f1cec28
 
 
589548f
 
 
 
d5ffdea
589548f
 
 
d5ffdea
589548f
 
 
 
 
 
d5ffdea
589548f
 
 
d5ffdea
589548f
 
 
 
 
 
d5ffdea
589548f
 
 
d5ffdea
589548f
 
 
f1cec28
 
7e37373
f1cec28
06f1304
f1cec28
 
8d9cd67
f23d9a1
7e37373
06f1304
 
 
7e37373
06f1304
f1cec28
8d9cd67
7e37373
f23d9a1
f651420
 
b43d9ad
8d9cd67
 
7e37373
f23d9a1
f651420
8d9cd67
 
7e37373
f23d9a1
d57041b
4870976
 
d57041b
 
 
8d9cd67
 
7e37373
f23d9a1
7e37373
 
 
 
 
 
 
 
 
 
 
f651420
7e37373
 
9302889
7e37373
 
 
f651420
7e37373
 
 
 
 
8d9cd67
 
06f1304
f23d9a1
06f1304
 
 
 
 
 
 
 
 
 
8d9cd67
 
06f1304
f23d9a1
06f1304
9fff5a2
06f1304
 
 
 
f1cec28

---

language: vie
datasets:
- legacy-datasets/common_voice
- vlsp2020_vinai_100h
- AILAB-VNUHCM/vivos
- doof-ferb/vlsp2020_vinai_100h
- doof-ferb/fpt_fosd
- doof-ferb/infore1_25hours
- linhtran92/viet_bud500
- doof-ferb/LSVSC
- doof-ferb/vais1000
- doof-ferb/VietMed_labeled
- NhutP/VSV-1100
- doof-ferb/Speech-MASSIVE_vie
- doof-ferb/BibleMMS_vie
- capleaf/viVoice
metrics:
- wer
pipeline_tag: automatic-speech-recognition
tags:
- transcription
- audio
- speech
- chunkformer
- asr
- automatic-speech-recognition
- long-form transcription
license: cc-by-nc-4.0
model-index:
- name: ChunkFormer Large Vietnamese
  results:
  - task: 
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: common-voice-vietnamese
      type: common_voice
      args: vi
    metrics:
       - name: Test WER
         type: wer
         value: x
  - task: 
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: VIVOS
      type: vivos
      args: vi
    metrics:
       - name: Test WER
         type: wer
         value: x
  - task: 
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: VLSP - Task 1
      type: vlsp
      args: vi
    metrics:
       - name: Test WER
         type: wer
         value: x
---


# **ChunkFormer-Large-Vie: Large-Scale Pretrained ChunkFormer for Vietnamese Automatic Speech Recognition**
[![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
[![GitHub](https://img.shields.io/badge/GitHub-ChunkFormer-blue)](https://github.com/khanld/chunkformer)
[![Paper](https://img.shields.io/badge/Paper-ICASSP%202025-green)](https://your-paper-link)

---
## Table of contents
1. [Model Description](#description)
2. [Documentation and Implementation](#implementation)
3. [Benchmark Results](#benchmark)
4. [Usage](#usage)
6. [Citation](#citation)
7. [Contact](#contact)

---
<a name = "description" ></a>
## Model Description
**ChunkFormer-Large-Vie** is a large-scale Vietnamese Automatic Speech Recognition (ASR) model based on the innovative **ChunkFormer** architecture, introduced at **ICASSP 2025**. The model has been fine-tuned on approximately **2000 hours** of public Vietnamese speech data sourced from diverse datasets. A list of datasets can be found [**HERE**](dataset.tsv). 

**!!! Please note that only the \[train-subset\] was used for tuning the model.**

---
<a name = "implementation" ></a>
## Documentation and Implementation
The [Documentation](#) and [Implementation](#) of ChunkFormer are publicly available.

---
<a name = "benchmark" ></a>
## Benchmark Results
| STT | Model        | #Params | Vivos | Common Voice | VLSP - Task 1 | Avg. |
|-----|--------------|--------|-------|--------------|---------------|------|
| 1   | ChunkFormer  | 110M      | x     | x            | x             | x    |
| 2   | [PhoWhisper](https://huggingface.co/vinai/PhoWhisper-large)   | 1.55B  | 4.67 | 8.14	| 13.75   | 8.85 |
| 3   | [nguyenvulebinh/wav2vec2-base-vietnamese-250h](nguyenvulebinh/wav2vec2-base-vietnamese-250h) | 95M            | 10.77	| 18.34	| 13.33 | 14.15    |
| 4   | [khanhld/wav2vec2-base-vietnamese-160h](https://huggingface.co/khanhld/wav2vec2-base-vietnamese-160h) | 95M          | 15.05	| 10.78         | x             | x    |

---
<a name = "usage" ></a>
## Quick Usage
To use the ChunkFormer model for Vietnamese Automatic Speech Recognition, follow these steps:

1. **Download the ChunkFormer Repository**
```bash

git clone https://github.com/khanld/chunkformer.git

cd chunkformer

pip install -r requirements.txt   

```
2. **Download the Model Checkpoint from Hugging Face**
```bash

git lfs install

git clone https://huggingface.co/khanhld/chunkformer-large-vie

```
This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.

3. **Run the model**
```bash

python decode.py \

    --model_checkpoint path/to/chunkformer-large-vie \

    --long_form_audio path/to/long_audio.wav \

    --chunk_size 64 \

    --left_context_size 128 \

    --right_context_size 128

```

---
<a name = "citation" ></a>
## Citation
If you use this work in your research, please cite:

```bibtex

@inproceedings{your_paper,

  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},

  author={Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau},

  booktitle={ICASSP},

  year={2025}

}

```

---
<a name = "contact"></a>
## Contact
- [email protected]
- [![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/khanld)
- [![LinkedIn](https://img.shields.io/badge/linkedin-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/khanhld257/)