Whisper large-v3-singlish

Whisper large-v3-singlish is a fine-tuned automatic speech recognition (ASR) model optimized for Singlish. Built on OpenAI's Whisper model, it has been adapted using Singlish-specific data to accurately capture the unique phonetic and lexical nuances of Singlish speech.

Model Details

  • Developed by: Ming Jie Wong
  • Base Model: openai/whisper-large-v3
  • Model Type: Encoder-decoder
  • Metrics: Word Error Rate (WER)
  • Languages Supported: English (with a focus on Singlish)
  • License: Apache-2.0

Description

Whisper large-v3-singlish is developed using an internal dataset of 66.9k audio-transcript pairs. The dataset is derived exclusively from the Part 3 Same Room Environment Close-talk Mic recordings of IMDA's NSC Corpus.

The original Part 3 of the National Speech Corpus comprises approximately 1,000 hours of conversational speech from around 1,000 local English speakers, recorded in pairs. These conversations cover everyday topics and include interactive game-based dialogues. Recordings were conducted in two environments:

  • Same Room, where speakers shared a room and were recorded using a close-talk mic and a boundary mic.
  • Separate Room, where each speaker was recorded individually using a standing mic and a telephone (IVR).

Audio segments for the internal dataset were extracted using these criteria:

  • Minimum Word Count: 10 words

    This threshold was chosen to ensure that each audio segment contains sufficient linguistic context for the model to better understand instructions in Singlish. Shorter segments may bias the model towards specific utterances or phrases, limiting its overall comprehension.

  • Maximum Duration: 20 seconds

    This threshold was chosen to provide enough context for accurate transcription while minimizing noise and computational complexity for longer audio segments.

  • Sampling Rate: All audio segments are down-sampled to 16kHz.

Full experiments details will be added soon.

Fine-Tuning Details

We applied fine-tuning on a single A100-80GB GPU.

Training Hyperparameters

The following hyperparameters are used:

  • batch_size: 8
  • gradient_accumulation_steps: 2
  • learning_rate: 5e-8
  • warmup_steps: 500
  • max_steps: 5000
  • fp16: true
  • eval_batch_size: 8
  • eval_step: 300
  • max_grad_norm: 1.0
  • generation_max_length: 225

Training Results

The table below summarizes the modelโ€™s progress across various training steps, showing the training loss, evaluation loss, and Word Error Rate (WER).

Steps Train Loss Eval Loss WER
300 1.6879 1.4495 70.680466
600 1.3011 1.0669 48.520662
900 0.8413 0.6757 19.961466
1200 0.6635 0.5910 15.904360
1500 0.6056 0.5285 15.622370
1800 0.5485 0.4633 14.692986
2100 0.4744 0.4175 14.560111
2400 0.4890 0.3894 14.193229
2700 0.4407 0.3784 14.191015
3000 0.4675 0.3708 14.348988
3300 0.4260 0.3661 14.264834
3600 0.4174 0.3627 14.389589

Although training was capped at a maximum of 5,000 steps, early stopping was employed with a patience of 3 using EarlyStoppingCallback, and the final model checkpoint corresponds to the step with the lowest WER โ€” a strategy informed by prior experience fine-tuning similar Whisper models such as whisper-large-v3-turbo and whisper-small.

Benchmark Performance

We evaluated Whisper large-v3-singlish on SASRBench-v1, a benchmark dataset for evaluating ASR performance on Singlish:

Disclaimer

While this model has been fine-tuned to better recognize Singlish, users may experience inaccuracies, biases, or unexpected outputs, particularly in challenging audio conditions or with speakers using non-standard variations. Use of this model is at your own risk; the developers and distributors are not liable for any consequences arising from its use. Please validate results before deploying in any sensitive or production environment.

How to use the model

The model can be loaded with the automatic-speech-recognition pipeline like so:

from transformers import pipeline
model = "mjwong/whisper-large-v3-singlish"
pipe = pipeline("automatic-speech-recognition", model)

You can then use this pipeline to transcribe audios of arbitrary length.

from datasets import load_dataset
dataset = load_dataset("mjwong/SASRBench-v1", split="test")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

Contact

For more information, please reach out to [email protected].

Acknowledgements

  1. https://www.jensenlwt.com/blog/singlish-whisper-finetuning-asr-for-singapore-unique-english
  2. https://github.com/huggingface/community-events/blob/main/whisper-fine-tuning-event/README.md
  3. https://medium.com/htx-dsai/finetuning-whisper-for-the-singaporean-home-team-context-a3ae1a6ae809
Downloads last month
21
Safetensors
Model size
1.54B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mjwong/whisper-large-v3-singlish

Finetuned
(453)
this model

Space using mjwong/whisper-large-v3-singlish 1

Collection including mjwong/whisper-large-v3-singlish

Evaluation results