auffusion-full / README.md
auffusion's picture
first commit
db5169f
metadata
license: cc-by-nc-sa-4.0
language:
  - en
tags:
  - audio

Auffusion is a latent diffusion model (LDM) for text-to-audio (TTA) generation. Auffusion can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. We release our model, inference code, and pre-trained checkpoints for the research community.

📣 We are releasing Auffusion-Full-no-adapter which was pre-trained on all datasets described in paper and created for easy use of audio manipulation.

📣 We are releasing Auffusion-Full which was pre-trained on all datasets described in paper.

📣 We are releasing Auffusion which was pre-trained on AudioCaps.

Auffusion Model Family

Code

Our code is released here: https://github.com/happylittlecat2333/Auffusion

We uploaded several Auffusion generated samples here: https://auffusion.github.io

Please follow the instructions in the repository for installation, usage and experiments.

Quickstart Guide

First, git clone the repository and install the requirements:

git clone https://github.com/happylittlecat2333/Auffusion/
cd Auffusion
pip install -r requirements.txt

Download the Auffusion model and generate audio from a text prompt:

import IPython, torch
import soundfile as sf
from auffusion_pipeline import AuffusionPipeline

pipeline = AuffusionPipeline.from_pretrained("auffusion/auffusion")

prompt = "Birds singing sweetly in a blooming garden"
output = pipeline(prompt=prompt)
audio = output.audios[0]
sf.write(f"{prompt}.wav", audio, samplerate=16000)
IPython.display.Audio(data=audio, rate=16000)

The auffusion model will be automatically downloaded from huggingface and saved in cache. Subsequent runs will load the model directly from cache.

The generate function uses 100 steps and 7.5 guidance_scale by default to sample from the latent diffusion model. You can also vary parameters for different results.

prompt = "Rolling thunder with lightning strikes"
output = pipeline(prompt=prompt, num_inference_steps=100, guidance_scale=7.5)
audio = output.audios[0]
IPython.display.Audio(data=audio, rate=16000)

Citation

Please consider citing the following article if you found our work useful:

@article{xue2024auffusion,
  title={Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation}, 
  author={Jinlong Xue and Yayue Deng and Yingming Gao and Ya Li},
  journal={arXiv preprint arXiv:2401.01044},
  year={2024}
}