File size: 5,303 Bytes
1806b43 6e6a7e8 1806b43 c8c2a58 1806b43 c8c2a58 1806b43 b560801 d7d406e 8034c9c 1806b43 2739695 1806b43 c8c2a58 2e5181f c8c2a58 8034c9c c8c2a58 b14b2d8 c8c2a58 d7d406e c8c2a58 8034c9c c8c2a58 72b901d c8c2a58 d7d406e 97aa366 c8c2a58 d7d406e c8c2a58 d7d406e c8c2a58 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
---
language:
- zh
license: apache-2.0
tags:
- whisper-event
- generated_from_trainer
base_model: openai/whisper-small
datasets:
- mozilla-foundation/common_voice_11_0
model-index:
- name: Whisper Small zh-HK - Alvin
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: mozilla-foundation/common_voice_16_0 yue
type: mozilla-foundation/common_voice_16_0
config: yue
split: test
args: yue
metrics:
- name: Normalized CER
type: cer
value: 7.93
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# Whisper Small zh-HK - Alvin
This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on the Cantonese language. It achieves a 8.94 CER (without punctuations), 10.73 CER (with punctuations) on Common Voice 16.0
## Training and evaluation data
For training,
- CantoMap: Winterstein, Grégoire, Tang, Carmen and Lai, Regine (2020) "CantoMap: a Hong Kong Cantonese MapTask Corpus", in Proceedings of The 12th Language Resources and Evaluation Conference, Marseille: European Language Resources Association, p. 2899-2906.
- Cantonse-ASR: Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung, Lovenia, Holy, Dai, Wenliang, Barezi, Elham, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram, Fung, Pascale (2022) "Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset", 2022. Link: https://arxiv.org/pdf/2201.02419.pdf
|Name|# of Hours|
|--|--|
|Common Voice 16.0 zh-HK Train|138|
|Common Voice 16.0 yue Train|85|
|Common Voice 17.0 yue Train|178|
|Cantonese-ASR|72|
|CantoMap|23|
|[Pseudo-Labelled YouTube Data](https://huggingface.co/datasets/alvanlii/cantonese-youtube-pseudo-transcription)|438|
For evaluation, Common Voice 16.0 yue Test set is used.
## Results
- CER (lower is better): 0.0972
- down from 0.1073, 0.1581 in the previous versions
- CER (punctuations removed): 0.0793
- GPU Inference with Fast Attention (example below): 0.055s/sample
- Note all GPU evaluations are done on RTX 3090 GPU
- GPU Inference: 0.308s/sample
- CPU Inference: 2.57s/sample
- GPU VRAM: ~1.5 GB
## Using the Model
```
import librosa
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
y, sr = librosa.load('audio.mp3', sr=16000)
MODEL_NAME = "alvanlii/whisper-small-cantonese"
processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME)
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
model.config.use_cache = False
processed_in = processor(y, sampling_rate=sr, return_tensors="pt")
gout = model.generate(
input_features=processed_in.input_features,
output_scores=True, return_dict_in_generate=True
)
transcription = processor.batch_decode(gout.sequences, skip_special_tokens=True)[0]
print(transcription)
```
- Alternatively, you can use huggingface pipelines
```
from transformers import pipeline
MODEL_NAME = "alvanlii/whisper-small-cantonese"
lang = "zh"
pipe = pipeline(
task="automatic-speech-recognition",
model=MODEL_NAME,
chunk_length_s=30,
device=device,
)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task="transcribe")
text = pipe(file)["text"]
```
## Model Speedup
Just add attn_implementation="sdpa" for Flash Attention.
```
model = AutoModelForSpeechSeq2Seq.from_pretrained(
"alvanlii/whisper-small-cantonese",
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
use_safetensors=True,
attn_implementation="sdpa",
)
```
Using Flash Attention reduced the amount of time taken per sample from 0.308s to 0.055s.
## Speculative Decoding
You can use a bigger model, then use `alvanlii/whisper-small-cantonese` to speed up inference with basically no loss in accuracy.
```
model_id = "simonl0909/whisper-large-v2-cantonese"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
use_safetensors=True,
attn_implementation="sdpa",
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
assistant_model_id = "alvanlii/whisper-small-cantonese"
assistant_model = AutoModelForSpeechSeq2Seq.from_pretrained(
assistant_model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
use_safetensors=True,
attn_implementation="sdpa",
)
assistant_model.to(device)
...
model.generate(**inputs, use_cache=True, assistant_model=assistant_model)
```
In the original `simonl0909/whisper-large-v2-cantonese` model, it runs at 0.714s/sample for a CER of 7.65. \
Using speculative decoding with `alvanlii/whisper-small-cantonese`, it runs at 0.137s/sample for a CER of 7.67, which is much faster.
## Training Hyperparameters
- learning_rate: 5e-5
- train_batch_size: 25 (on 1 3090 GPU)
- eval_batch_size: 8
- gradient_accumulation_steps: 4
- total_train_batch_size: 25x4=100
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- training_steps: 15000
- augmentation: None
|