File size: 5,303 Bytes
1806b43
 
 
 
 
 
 
6e6a7e8
1806b43
 
 
 
 
 
 
 
 
c8c2a58
 
 
1806b43
c8c2a58
1806b43
b560801
d7d406e
8034c9c
1806b43
 
 
 
 
 
2739695
1806b43
 
c8c2a58
2e5181f
 
 
c8c2a58
 
 
 
8034c9c
c8c2a58
 
b14b2d8
c8c2a58
d7d406e
c8c2a58
 
 
8034c9c
 
 
c8c2a58
 
 
 
 
 
 
72b901d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c8c2a58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d7d406e
97aa366
c8c2a58
d7d406e
c8c2a58
 
d7d406e
 
 
c8c2a58
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
language:
- zh
license: apache-2.0
tags:
- whisper-event
- generated_from_trainer
base_model: openai/whisper-small
datasets:
- mozilla-foundation/common_voice_11_0
model-index:
- name: Whisper Small zh-HK - Alvin
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: mozilla-foundation/common_voice_16_0 yue
      type: mozilla-foundation/common_voice_16_0
      config: yue
      split: test
      args: yue
    metrics:
    - name: Normalized CER
      type: cer
      value: 7.93
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# Whisper Small zh-HK - Alvin

This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on the Cantonese language. It achieves a 8.94 CER (without punctuations), 10.73 CER (with punctuations) on Common Voice 16.0

## Training and evaluation data
For training,
- CantoMap: Winterstein, Grégoire, Tang, Carmen and Lai, Regine (2020) "CantoMap: a Hong Kong Cantonese MapTask Corpus", in Proceedings of The 12th Language Resources and Evaluation Conference, Marseille: European Language Resources Association, p. 2899-2906.
- Cantonse-ASR: Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung, Lovenia, Holy, Dai, Wenliang, Barezi, Elham, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram, Fung, Pascale (2022) "Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset", 2022. Link: https://arxiv.org/pdf/2201.02419.pdf

|Name|# of Hours|
|--|--|
|Common Voice 16.0 zh-HK Train|138|
|Common Voice 16.0 yue Train|85|
|Common Voice 17.0 yue Train|178|
|Cantonese-ASR|72|
|CantoMap|23|
|[Pseudo-Labelled YouTube Data](https://huggingface.co/datasets/alvanlii/cantonese-youtube-pseudo-transcription)|438|


For evaluation, Common Voice 16.0 yue Test set is used.

## Results
- CER (lower is better): 0.0972
  - down from 0.1073, 0.1581 in the previous versions
- CER (punctuations removed): 0.0793
- GPU Inference with Fast Attention (example below): 0.055s/sample
  - Note all GPU evaluations are done on RTX 3090 GPU
- GPU Inference: 0.308s/sample
- CPU Inference: 2.57s/sample
- GPU VRAM: ~1.5 GB


## Using the Model
```
import librosa

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

y, sr = librosa.load('audio.mp3', sr=16000)

MODEL_NAME = "alvanlii/whisper-small-cantonese"

processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME)

model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
model.config.use_cache = False

processed_in = processor(y, sampling_rate=sr, return_tensors="pt")
gout = model.generate(
    input_features=processed_in.input_features, 
    output_scores=True, return_dict_in_generate=True
)
transcription = processor.batch_decode(gout.sequences, skip_special_tokens=True)[0]
print(transcription)
```
- Alternatively, you can use huggingface pipelines
```
from transformers import pipeline
MODEL_NAME = "alvanlii/whisper-small-cantonese" 
lang = "zh"
pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    chunk_length_s=30,
    device=device,
)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task="transcribe")
text = pipe(file)["text"]
```

## Model Speedup
Just add attn_implementation="sdpa" for Flash Attention. 
```
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "alvanlii/whisper-small-cantonese",
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
    attn_implementation="sdpa",
)
```
Using Flash Attention reduced the amount of time taken per sample from 0.308s to 0.055s.

## Speculative Decoding
You can use a bigger model, then use `alvanlii/whisper-small-cantonese` to speed up inference with basically no loss in accuracy.
```
model_id = "simonl0909/whisper-large-v2-cantonese"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
    attn_implementation="sdpa",
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

assistant_model_id = "alvanlii/whisper-small-cantonese"

assistant_model = AutoModelForSpeechSeq2Seq.from_pretrained(
    assistant_model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
    attn_implementation="sdpa",
)

assistant_model.to(device)
...
model.generate(**inputs, use_cache=True, assistant_model=assistant_model)
```
In the original `simonl0909/whisper-large-v2-cantonese` model, it runs at 0.714s/sample for a CER of 7.65. \
Using speculative decoding with `alvanlii/whisper-small-cantonese`, it runs at 0.137s/sample for a CER of 7.67, which is much faster.

## Training Hyperparameters
- learning_rate: 5e-5
- train_batch_size: 25 (on 1 3090 GPU)
- eval_batch_size: 8
- gradient_accumulation_steps: 4
- total_train_batch_size: 25x4=100
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- training_steps: 15000
- augmentation: None