Ichigo-whisper-v0.1 / README.md
alandao's picture
Update README.md
072b8f1 verified
metadata
datasets:
  - homebrewltd/instruction-speech-whispervq-v2
language:
  - en
  - vi
license: cc-by-nc-sa-4.0
tags:
  - sound language model
  - audio-text-to-text
  - torchtune
  - whisperspeech

image/png

Ichigo Whisper

Ichigo Whisper is a compact (22M parameters), open-source speech tokenizer for the Whisper-medium model, designed to enhance performance on multilingual with minimal impact on its original English capabilities. Unlike models that output continuous embeddings, Ichigo Whisper compresses speech into discrete tokens, making it more compatible with large language models (LLMs) for immediate speech understanding.

This speech tokenizer has been trained on over ~400 hours of English data and ~1000 hours of Vietnamese data.

Ichigo Whisper is a key component of the Ichigo v0.5 family.

For more details, please refer to our official blog post.

Model Summary

Developed by: Homebrew Research.

Model Architecture: WhisperVQ

Model type: Quantizer of Whisper

Language(s): English and Vietnamese

License: CC-BY-NC-SA-4.0

Resources

Demo: Ichigo Whisper demo

Blog: Blog post

Intended Use

Intended Use Cases This model is primarily intended for research applications. This version aims to further improve the Whisper on sound low-resource languages.

Out-of-scope The use of Ichigo Whisper in any manner that violates applicable laws or regulations is strictly prohibited.

How to Get Started

For inference, please refer to the official Ichigo Whisper repository.

python demo/inference.py --input path/to/your/audio.wav

Training Specs

Hardware Specifications

Component Details
GPUs 8 × NVIDIA A6000

Training Time

Phase Duration
Phase 1 75 hours (50 epochs)
Phase 2 29 hours (20 epochs)
Total Training 104 hours

Phase 1: With KL Loss

Parameter Value
Initialization Method WhisperVQ-Large-v3 (7 languages) embeddings with duplication
Epochs 50
Global Batch Size 336
Learning Rate 1e-3
Learning Scheduler Linear warm-up with Cosine decay
Optimizer AdamW
Warmup Ratio 500
Weight Decay 0.001
Max Audio Length 30 seconds (padded audio)

Phase 2: Without KL Loss

Parameter Value
Initialization Method Phase 1 checkpoint
Epochs 20
Global Batch Size 336
Learning Rate 1e-3
Learning Scheduler Linear warm-up with Cosine decay
Optimizer AdamW
Warmup Ratio 500
Weight Decay 0.001
Max Audio Length 30 seconds (padded audio)

Evaluation

  1. Vietnamese
Model Name Codebook Size Dataset test Test samples WER
IchigoWhisper 2561 viVoice 10000 11.68
Whisper Medium - viVoice 10000 18.30
  1. English
Model Name Codebook Size Dataset test Test samples WER
IchigoWhisper 2561 LibriTTS-R 4689 11.89
Whisper Medium - LibriTTS-R 4689 13.06

Citation Information

BibTeX:

@article{IchigoWhisper 2024,
  title={IchigoWhisper},
  author={Homebrew Research},
  year=2024,
  month=December},
  url={https://huggingface.co/homebrewltd/Ichigo-whisper}

Acknowledgement