datasets:
- homebrewltd/instruction-speech-whispervq-v2
language:
- en
- vi
license: cc-by-nc-sa-4.0
tags:
- sound language model
- audio-text-to-text
- torchtune
- whisperspeech
Ichigo Whisper
Ichigo Whisper is a compact (22M parameters), open-source speech tokenizer for the Whisper-medium model
, designed to enhance performance on multilingual with minimal impact on its original English capabilities. Unlike models that output continuous embeddings, Ichigo Whisper compresses speech into discrete tokens, making it more compatible with large language models (LLMs) for immediate speech understanding.
This speech tokenizer has been trained on over ~400 hours of English data and ~1000 hours of Vietnamese data.
Ichigo Whisper is a key component of the Ichigo v0.5 family.
For more details, please refer to our official blog post.
Model Summary
Developed by: Homebrew Research.
Model Architecture: WhisperVQ
Model type: Quantizer of Whisper
Language(s): English and Vietnamese
License: CC-BY-NC-SA-4.0
Resources
Demo: Ichigo Whisper demo
Blog: Blog post
Intended Use
Intended Use Cases This model is primarily intended for research applications. This version aims to further improve the Whisper on sound low-resource languages.
Out-of-scope The use of Ichigo Whisper in any manner that violates applicable laws or regulations is strictly prohibited.
How to Get Started
For inference, please refer to the official Ichigo Whisper repository.
python demo/inference.py --input path/to/your/audio.wav
Training Specs
Hardware Specifications
Component | Details |
---|---|
GPUs | 8 × NVIDIA A6000 |
Training Time
Phase | Duration |
---|---|
Phase 1 | 75 hours (50 epochs) |
Phase 2 | 29 hours (20 epochs) |
Total Training | 104 hours |
Phase 1: With KL Loss
Parameter | Value |
---|---|
Initialization Method | WhisperVQ-Large-v3 (7 languages) embeddings with duplication |
Epochs | 50 |
Global Batch Size | 336 |
Learning Rate | 1e-3 |
Learning Scheduler | Linear warm-up with Cosine decay |
Optimizer | AdamW |
Warmup Ratio | 500 |
Weight Decay | 0.001 |
Max Audio Length | 30 seconds (padded audio) |
Phase 2: Without KL Loss
Parameter | Value |
---|---|
Initialization Method | Phase 1 checkpoint |
Epochs | 20 |
Global Batch Size | 336 |
Learning Rate | 1e-3 |
Learning Scheduler | Linear warm-up with Cosine decay |
Optimizer | AdamW |
Warmup Ratio | 500 |
Weight Decay | 0.001 |
Max Audio Length | 30 seconds (padded audio) |
Evaluation
- Vietnamese
Model Name | Codebook Size | Dataset test | Test samples | WER |
---|---|---|---|---|
IchigoWhisper | 2561 | viVoice | 10000 | 11.68 |
Whisper Medium | - | viVoice | 10000 | 18.30 |
- English
Model Name | Codebook Size | Dataset test | Test samples | WER |
---|---|---|---|---|
IchigoWhisper | 2561 | LibriTTS-R | 4689 | 11.89 |
Whisper Medium | - | LibriTTS-R | 4689 | 13.06 |
Citation Information
BibTeX:
@article{IchigoWhisper 2024,
title={IchigoWhisper},
author={Homebrew Research},
year=2024,
month=December},
url={https://huggingface.co/homebrewltd/Ichigo-whisper}
Acknowledgement
[LibriTTS]