Papers
arxiv:2602.04683

UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization

Published on Feb 4
ยท Submitted by
Dongchao Yang
on Feb 6
Authors:
,
,
,
,
,

Abstract

Researchers developed a discrete audio codec called ReasoningCodec that separates audio into reasoning and reconstruction tokens for improved understanding and generation, and created UniAudio 2.0, a unified autoregressive model trained on large-scale text and audio data that shows strong performance across various audio tasks and generalizes well in few-shot and zero-shot scenarios.

AI-generated summary

We study two foundational problems in audio language models: (1) how to design an audio tokenizer that can serve as an intermediate representation for both understanding and generation; and (2) how to build an audio foundation model that generalizes in few-shot and zero-shot settings, analogous to large language models. To this end, we make the following two contributions. First, we propose ReasoningCodec, a discrete audio codec that factorizes audio into (i) reasoning tokens, which encode text-aligned, high-level analysis and planning representations for audio understanding and hierarchical generation, and (ii) reconstruction tokens, which encode semantic-rich acoustic cues for high-fidelity waveform reconstruction. This design achieves understanding performance comparable to strong continuous representations while improving generation quality and reconstruction fidelity over prior discrete tokenizers. Second, we introduce a unified autoregressive architecture for text and audio, together with multi-stage training and multi-task data construction. Using this framework, we train UniAudio 2.0 on 100B text tokens and 60B audio tokens. Across a wide range of speech, sound, and music tasks, UniAudio 2.0 performs competitively on in-domain evaluations and demonstrates strong few-shot and zero-shot generalization to unseen tasks. Demo, code, and checkpoints will be available at https://dongchaoyang.top/UniAudio2Demo/{https://dongchaoyang.top/UniAudio2Demo/}.

Community

Paper submitter

Audio Foundation Models

arXivLens breakdown of this paper ๐Ÿ‘‰ https://arxivlens.com/PaperView/Details/uniaudio-2-0-a-unified-audio-language-model-with-text-aligned-factorized-audio-tokenization-2105-0ec8879c

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.04683 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.04683 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.04683 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.