Audio-to-Audio
Transformers
Safetensors
speech_language_model
Inference Endpoints

Slamming: Training a Speech Language Model on One GPU in a Day

The model was presented in the paper Slamming: Training a Speech Language Model on One GPU in a Day.

Paper abstract

We introduce Slam, a recipe for training high-quality Speech Language Models (SLMs) on a single academic GPU in 24 hours. We do so through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components. We empirically demonstrate that this training recipe also scales well with more compute getting results on par with leading SLMs in a fraction of the compute cost. We hope these insights will make SLM training and research more accessible. In the context of SLM scaling laws, our results far outperform predicted compute optimal performance, giving an optimistic view to SLM feasibility. See code, data, models, samples at - https://pages.cs.huji.ac.il/adiyoss-lab/slamming .

Model Card for Model ID

This is a Speech Language Model (SLM) trained for generating speech continuations over discrete Hubert tokens.

Model Details

Model Description

This Speech Language Model, introduced in "Slamming: Training a Speech Language Model on One GPU in a Day", focuses on efficient training. It was fine-tuned from Qwen/Qwen2.5-0.5B over a vocabulary of 500 speech tokens extracted from the 11-th layer of mhubert-25hz.

The model was pre-trained using next-token prediction on a subset of LibriSpeech, Libri-Light and a synthetic dataset sTinyStories. It was subsequently fine-tuned with DPO on SpokenSwag.

Model Sources

Uses

This base SpeechLM can be used to generate continuations for speech segments, or as a base for further tuning. See the SlamKit codebase for more details on usage, and checkout the demo page for some generation examples

Out-of-Scope Use

This model was trained on curated speech datasets which contain mainly audio-books and stories, as such the outputs should not be treated as factual in any way.

How to Get Started with the Model

We refer users to the official repository for full usage explanations - github.

Training Details

We highly encourage users to read the full paper, for full training details, a brief overview is provided below.

Training Data

This model was trained on a subset of LibriSpeech train, Libri-Light and the synthetic dataset sTinyStories for the pre-training phase. It was also trained with DPO on the synthetic dataset SpokenSwag.

Training Procedure

This model was trained by next token prediction over several datasets, and then trained with DPO over SpokenSwag. Please refer to the paper or code for the full training recipes.

Preprocessing

Speech tokens are extracted from the audio using Hubert-25hz, and quantised using the official kmeans released with the model in textlesslib. Units are de-duplicated. We encourage you to explore the official repository for full details - github.

Evaluation

The paper provides full results, we do give here some results and also refer to the demo page to listen to some samples.

Model GPUs Params Num Tokens sBLIMP ↑ sStoryCloze ↑ tStoryCloze ↑ GenPPL ↓ Auto-BLEU ↓
Speech only pre-training
GSLM 8×V100 100M 1B 54.2 53.3 66.6
SyllableLM 4×A40 300M 16B 63.7 75.4
TWIST-350M 8×V100 305M 10.8B 56.2 137.3 3.46
TWIST-1.3B 32×V100 1B 10.8B 57.0 52.4 70.6 131.8 3.20
TWIST-7B 32×V100 7B 36B 59.0 55.3 74.1 93.74 3.06
TWIST-13B 32×V100 13B 36B 59.2 55.4 76.4
Scaled Optimal 823M 82B 61.3 56.7 78.0
Moshi ?×H100 7B ? 58.9 58.7 81.8
SpiritLM 64×A100 7B 100B 58.0 54.8 72.9
With text / preference optimization
Scaling Interleaving 9B ~1T 62.4 82.9
Moshi ?×H100 7B ~720B 58.8 60.8 83.0
SpiritLM 64×A100 7B 100B 58.3 61.0 82.9
AlignSLM-1.3B 64×A100 1B 10.8B + ~158B 59.8 55.0 80.0
AlignSLM-7B 64×A100 7B 36B + ~158B 62.3 61.1 86.8
Ours (Slam)
Slam (-DPO) 2×A100 358M 16.7B 58.53 58.15 80.71 67.3 3.25
Slam 1×A5000 358M 1.4B + 5M 58.86 58.04 82.04 62.8 3.88
Slam (scaled) 2×A100 358M 16.7B + 9M 61.11 61.30 84.18 46.6 3.75

Compute Infrastructure

This model was trained as part of "Slamming: Training a Speech Language Model on One GPU in a Day", focusing on efficient training.

Hardware

This model was trained using only 2 Nvidia A100 GPU for 48 hours.

Software

The model was trained using the SlamKit codebase which builds upon 🤗transformers extending it to support easy and efficient training of Speech Language Models.

Citation

BibTeX:

@misc{maimon2025slamming,
      title={Slamming: Training a Speech Language Model on One GPU in a Day}, 
      author={Gallil Maimon and Avishai Elmakies and Yossi Adi},
      year={2025},
      eprint={2502.15814},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.15814}, 
}
Downloads last month
15
Safetensors
Model size
358M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support audio-to-audio models for transformers library.

Model tree for slprl/slam_scaled

Base model

Qwen/Qwen2.5-0.5B
Finetuned
(123)
this model

Datasets used to train slprl/slam_scaled

Collection including slprl/slam_scaled