Soundwave: Less is More for Speech-Text Alignment in LLMs

πŸˆβ€β¬› Github ο½œ πŸ“ƒ Paper| πŸ“Ό Online Demo 

Model Description

Soundwave is a Speech-to-Text model that bridges the gap between speech and text. It is trained on just 10k hours of data and delivers exceptional performance in speech translation and AIR-Bench speech tasks.

Key Features

  • A Speech-to-Text Model Bridging the Gap Between Speech and Text
  • Utilizes Data-Efficient Strategy and Unique Architecture, Trained on Only 10k Hours of Data
  • Exceptional Performance in Speech Translation and AIR-Bench Speech Tasks
  • Retains Intelligence During Conversations, Ideal for Interactive Tasks

Usage

Load the Soundwave model and run inference with your audio files as shown in the GitHub repository.

πŸ“– Citation

@article{zhang2025soundwave,
  title={Soundwave: Less is More for Speech-Text Alignment in LLMs},
  author={Zhang, Yuhao and Liu, Zhiheng and Bu, Fan and Zhang, Ruiyu and Wang, Benyou and Li, Haizhou},
  journal={arXiv preprint arXiv:2502.12900},
  year={2025}
}
Downloads last month
0
Safetensors
Model size
9.4B params
Tensor type
FP16
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.