arxiv:2406.15487

Improving Text-To-Audio Models with Synthetic Captions

Published on Jun 18

Upvote

Authors:

Sang-gil Lee ,

Soujanya Poria ,

Abstract

It is an open challenge to obtain high quality training data, especially captions, for text-to-audio models. Although prior methods have leveraged text-only language models to augment and improve captions, such methods have limitations related to scale and coherence between audio and captions. In this work, we propose an audio captioning pipeline that uses an audio language model to synthesize accurate and diverse captions for audio at scale. We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named AF-AudioSet, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions. Through systematic evaluations on AudioCaps and MusicCaps, we find leveraging our pipeline and synthetic captions leads to significant improvements on audio generation quality, achieving a new state-of-the-art.

View arXiv page View PDF Add to collection

Community

soujanyaporia

Paper author Jun 25

Sharing our latest work on text-to-audio models.

🚀 We propose a data labeling pipeline to generate large-scale high-quality synthetic captions for audio.
🚀 We introduce AF-AudioSet: a large, diverse, and high-quality synthetic caption dataset produced with our pipeline.
🚀 We obtain state-of-the-art models on text-to-audio and text-to-music through pre-training on AF-AudioSet and conduct
a systematic study across various settings.

🔥 We will soon release TangoMusic which obtained great results on the MusicBench benchmark.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.15487 in a dataset README.md to link it from this page.

Improving Text-To-Audio Models with Synthetic Captions

Abstract

Community

Models citing this paper 2

Datasets citing this paper 0

Spaces citing this paper 1

Collections including this paper 1