Safetensors
whisper
File size: 1,027 Bytes
802419d
feb1828
 
 
 
 
 
 
 
 
 
802419d
 
feb1828
802419d
feb1828
802419d
feb1828
802419d
feb1828
 
 
802419d
feb1828
802419d
feb1828
802419d
feb1828
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
---
language:
- ms
- en
- zh
- ta
datasets:
- mesolitica/Malaysian-STT-Whisper
- malaysia-ai/STT-Whisper
base_model:
- openai/whisper-large-v3-turbo
---

# Malaysian Finetune Whisper Large V3 Turbo

Finetune Whisper Large V3 Turbo on Malaysian context.

## Improvement

1. Distilled from Whisper Large V3 on Malaysian and Science context.
2. Better translation for Malay, Manglish, Mandarin, Tamil and Science context.
3. Word level timestamp, introduced `<|transcribeprecise|>` token, **a new task!**

## how we finetuned it?

We done 2 phases,

1. Finetune on [mesolitica/Malaysian-STT-Whisper](https://huggingface.co/datasets/mesolitica/Malaysian-STT-Whisper)
- WanDB at https://wandb.ai/huseinzol05/malaysian-whisper-large-v3-turbo-v3?nw=nwuserhuseinzol05, **still on training**
2. Annealing on 5% from [mesolitica/Malaysian-STT-Whisper](https://huggingface.co/datasets/mesolitica/Malaysian-STT-Whisper) and 100% from [malaysia-ai/STT-Whisper](https://huggingface.co/datasets/malaysia-ai/STT-Whisper), **still on training**