File size: 6,169 Bytes
1c46201
d6f86ad
f1cec28
24a411a
c722643
 
 
 
 
 
 
 
 
 
 
 
 
f1cec28
 
 
 
 
 
 
 
 
 
9fff5a2
f1cec28
 
 
589548f
 
 
 
d5ffdea
589548f
 
 
d5ffdea
589548f
 
 
 
 
 
d5ffdea
589548f
 
 
d5ffdea
589548f
 
 
 
 
 
d5ffdea
589548f
 
 
d5ffdea
589548f
 
 
f1cec28
 
7e37373
f1cec28
06f1304
f1cec28
 
8d9cd67
f23d9a1
7e37373
06f1304
 
 
7e37373
06f1304
f1cec28
8d9cd67
7e37373
f23d9a1
02b3a93
f651420
b43d9ad
8d9cd67
 
7e37373
f23d9a1
da92452
8d9cd67
 
7e37373
f23d9a1
5ce8e2b
05581be
 
 
 
076056a
05581be
 
90f521b
7669cdb
 
05581be
 
 
 
b8f8d92
05581be
 
 
 
8d9cd67
7e37373
f23d9a1
7e37373
 
 
 
 
 
 
 
 
 
 
f651420
7e37373
 
9302889
7e37373
 
 
8084070
 
da92452
7e37373
 
 
 
8d9cd67
 
06f1304
f23d9a1
06f1304
 
 
f39aac2
06f1304
 
 
 
 
 
8d9cd67
 
06f1304
f23d9a1
06f1304
9fff5a2
06f1304
 
 
 
f1cec28
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---

language: vie
datasets:
- legacy-datasets/common_voice
- vlsp2020_vinai_100h
- AILAB-VNUHCM/vivos
- doof-ferb/vlsp2020_vinai_100h
- doof-ferb/fpt_fosd
- doof-ferb/infore1_25hours
- linhtran92/viet_bud500
- doof-ferb/LSVSC
- doof-ferb/vais1000
- doof-ferb/VietMed_labeled
- NhutP/VSV-1100
- doof-ferb/Speech-MASSIVE_vie
- doof-ferb/BibleMMS_vie
- capleaf/viVoice
metrics:
- wer
pipeline_tag: automatic-speech-recognition
tags:
- transcription
- audio
- speech
- chunkformer
- asr
- automatic-speech-recognition
- long-form transcription
license: cc-by-nc-4.0
model-index:
- name: ChunkFormer Large Vietnamese
  results:
  - task: 
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: common-voice-vietnamese
      type: common_voice
      args: vi
    metrics:
       - name: Test WER
         type: wer
         value: x
  - task: 
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: VIVOS
      type: vivos
      args: vi
    metrics:
       - name: Test WER
         type: wer
         value: x
  - task: 
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: VLSP - Task 1
      type: vlsp
      args: vi
    metrics:
       - name: Test WER
         type: wer
         value: x
---


# **ChunkFormer-Large-Vie: Large-Scale Pretrained ChunkFormer for Vietnamese Automatic Speech Recognition**
[![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
[![GitHub](https://img.shields.io/badge/GitHub-ChunkFormer-blue)](https://github.com/khanld/chunkformer)
[![Paper](https://img.shields.io/badge/Paper-ICASSP%202025-green)](https://your-paper-link)

---
## Table of contents
1. [Model Description](#description)
2. [Documentation and Implementation](#implementation)
3. [Benchmark Results](#benchmark)
4. [Usage](#usage)
6. [Citation](#citation)
7. [Contact](#contact)

---
<a name = "description" ></a>
## Model Description
**ChunkFormer-Large-Vie** is a large-scale Vietnamese Automatic Speech Recognition (ASR) model based on the **ChunkFormer** architecture, introduced at **ICASSP 2025**. The model has been fine-tuned on approximately **3000 hours** of public Vietnamese speech data sourced from diverse datasets. A list of datasets can be found [**HERE**](dataset.tsv). 

**!!! Please note that only the \[train-subset\] was used for tuning the model.**

---
<a name = "implementation" ></a>
## Documentation and Implementation
The [Documentation]() and [Implementation](https://github.com/khanld/chunkformer) of ChunkFormer are publicly available.

---
<a name = "benchmark" ></a>
## Benchmark Results
We evaluate the models using **Word Error Rate (WER)**. To ensure consistency and fairness in comparison, we manually apply **Text Normalization**, including the handling of numbers, uppercase letters, and punctuation.

1. **Public Models**:
| STT | Model                                                                  | #Params | Vivos | Common Voice | VLSP - Task 1 | Avg. |
|-----|------------------------------------------------------------------------|---------|-------|--------------|---------------|------|
| 1   | **ChunkFormer**                                                            | 110M    | 4.18   | 6.66           | 14.09             | **8.31**    |
| 2   | [vinai/PhoWhisper-large](https://huggingface.co/vinai/PhoWhisper-large)  | 1.55B   | 4.67  | 8.14         | 13.75         | 8.85 |
| 3   | [nguyenvulebinh/wav2vec2-base-vietnamese-250h](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h) | 95M     | 10.77 | 18.34        | 13.33         | 14.15 |
| 4   | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 1.55B   | 8.81     | 15.45            | 20.41          | 14.89    |
| 5   | [khanhld/wav2vec2-base-vietnamese-160h](https://huggingface.co/khanhld/wav2vec2-base-vietnamese-160h) | 95M     | 15.05 | 10.78        | 31.62             | 19.16    |
| 6   | [homebrewltd/Ichigo-whisper-v0.1](https://huggingface.co/homebrewltd/Ichigo-whisper-v0.1) | 22M   | 13.46     | 23.52            | 21.64          | 19.54    |

2. **Private Models (API)**:
| STT | Model  | VLSP - Task 1 |
|-----|--------|---------------|
| 1   | **ChunkFormer** | **13.9**             |
| 2   | Viettel     | 14.5          |
| 3   | Google  | 19.5          |
| 4   | FPT   | 28.8          |

---
<a name = "usage" ></a>
## Quick Usage
To use the ChunkFormer model for Vietnamese Automatic Speech Recognition, follow these steps:

1. **Download the ChunkFormer Repository**
```bash

git clone https://github.com/khanld/chunkformer.git

cd chunkformer

pip install -r requirements.txt   

```
2. **Download the Model Checkpoint from Hugging Face**
```bash

git lfs install

git clone https://huggingface.co/khanhld/chunkformer-large-vie

```
This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.

3. **Run the model**
```bash

python decode.py \

    --model_checkpoint path/to/local/chunkformer-large-vie \

    --long_form_audio path/to/audio.wav \

    --max_duration 14400 \ #in second, default is 1800

    --chunk_size 64 \

    --left_context_size 128 \

    --right_context_size 128

```

---
<a name = "citation" ></a>
## Citation
If you use this work in your research, please cite:

```bibtex

@inproceedings{chunkformer,

  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},

  author={Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau},

  booktitle={ICASSP},

  year={2025}

}

```

---
<a name = "contact"></a>
## Contact
- [email protected]
- [![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/khanld)
- [![LinkedIn](https://img.shields.io/badge/linkedin-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/khanhld257/)