Model Card for SLAM

This is a Speech Language Model trained for generating speech continuations over discrete Hubert tokens.

Model Details

Model Description

This is a Speech Language Model, introduced in "Slamming: Training a Speech Language Model on One GPU in a Day", focusing on efficient training. It was fine-tuned from Qwen/Qwen2.5-0.5B over a vocabulary of 500 speech tokens extracted from the 11-th layer of mhubert-25hz. For a stronger version of the model trained with slightly more compute - 2*A100 for 2 days, see slam_scaled.

The model was trained by next-token prediction over a subset of LibriSpeech, Libri-Light and a synthetic data sTinyStories. It was then trained with DPO over SpokenSwag.

Developed by: SLP-RL
Model type: SpeechLM
License: MIT
Finetuned from model: Qwen/Qwen2.5-0.5B

Model Sources

Repository: https://github.com/slp-rl/slamkit
Paper: https://arxiv.org/abs/2502.15814
Demo: https://pages.cs.huji.ac.il/adiyoss-lab/slamming/

Uses

This is a base SpeechLM and as such can be used to generate continuations for speech segments, or as base for further tuning. See the SlamKit codebase for more details on usage, and checkout the demo page for some generation examples

Out-of-Scope Use

This model was trained on curated speech datasets which contain mainly audio-books and stories, as such the outputs should not be treated as factual in any way.

How to Get Started with the Model

We refer users to the official repository for full usage explanations - github.

Training Details

We highly encourage users to read the full paper, for full training details, a brief overview is provided below.

Training Data

This model was trained on a subset of LibriSpeech train, Libri-Light and the synthetic dataset sTinyStories for the pre-training phase. It was also trained with DPO on the synthetic dataset SpokenSwag.

Training Procedure

This model was trained by next token prediction over several datasets, and then trained with DPO over SpokenSwag. Please refer to the paper or code for the full training recipes.

Preprocessing

Speech tokens are extracted from the audio using Hubert-25hz, and quantised using the official kmeans released with the model in textlesslib. Units are de-duplicated. We encourage you to explore the official repository for full details - github.

Evaluation

The paper provides full results, we do give here some results and also refer to the demo page to listen to some samples.

Model	Compute (GPU days)	Parameters	sBLIMP ↑	sStoryCloze ↑	tStoryCloze ↑	GenPPL ↓	Auto-BLEU ↓
TWIST-1.3B	160xV100	1B	57.00	52.4	70.6	131.8	3.20
TWIST-7B	?	7B	59.00	55.3	74.1	93.7	3.06
TWIST-13B	?	13B	59.20	55.4	76.4	-	-
Scaled Optimal	?	823M	61.3	56.7	78.0	-	-
Predicted Optimal	1xA5000	78M	56.85	54.09	70.49	-	-
TWIST-350M (Original recipe)	1xA5000	305M	51.52 ± .19	53.65 ± .57	68.80 ± .47	259.2 ± 6.7	3.26 ± .46
Slam (-DPO) (ours)	1xA5000	358M	56.45 ± .17	55.59 ± .30	78.01 ± .27	88.3 ± 1.0	3.47 ± .17
Slam (ours)	1xA5000	358M	58.86 ± .20	58.04 ± .51	82.04 ± .21	62.8 ± 4.1	3.88 ± .11

Compute Infrastructure

This model was trained as part of "Slamming: Training a Speech Language Model on One GPU in a Day", focusing on efficient training.

Hardware

This model was trained using only a single Nvidia A5000 GPU, 16 CPU cores and 24 GB of RAM for 24 hours.

Software

The model was trained using the SlamKit codebase which builds upon 🤗transformers extending it to support easy and efficient training of Speech Language Models.

Citation

BibTeX:

@misc{maimon2025slamming,
      title={Slamming: Training a Speech Language Model on One GPU in a Day}, 
      author={Gallil Maimon and Avishai Elmakies and Yossi Adi},
      year={2025},
      eprint={2502.15814},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.15814}, 
}

slprl
/

slam