File size: 3,417 Bytes
1af991d 40f7709 1af991d 40f7709 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
---
license: cc-by-nc-sa-4.0
datasets:
- AudioCaps+others
language:
- en
tags:
- audio
---
#
**Auffusion** is a latent diffusion model (LDM) for text-to-audio (TTA) generation. **Auffusion** can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. We release our model, inference code, and pre-trained checkpoints for the research community.
📣 We are releasing **Auffusion-Full-no-adapter** which was pre-trained on all datasets described in paper and created for easy use of audio manipulation.
📣 We are releasing **Auffusion-Full** which was pre-trained on all datasets described in paper.
📣 We are releasing **Auffusion** which was pre-trained on **AudioCaps**.
## Auffusion Model Family
| Model Name | Model Path |
|----------------------------|------------------------------------------------------------------------------------------------------------------------ |
| Auffusion | [https://huggingface.co/auffusion/auffusion](https://huggingface.co/auffusion/auffusion) |
| Auffusion-Full | [https://huggingface.co/auffusion/auffusion-full](https://huggingface.co/auffusion/auffusion-full) |
| Auffusion-Full-no-adapter | [https://huggingface.co/auffusion/auffusion-full-no-adapter](https://huggingface.co/auffusion/auffusion-full-no-adapter)|
## Code
Our code is released here: [https://github.com/happylittlecat2333/Auffusion](https://github.com/happylittlecat2333/Auffusion)
We uploaded several **Auffusion** generated samples here: [https://auffusion.github.io](https://auffusion.github.io)
Please follow the instructions in the repository for installation, usage and experiments.
## Quickstart Guide
First, git clone the repository and install the requirements:
```bash
git clone https://github.com/happylittlecat2333/Auffusion/
cd Auffusion
pip install -r requirements.txt
```
Download the **Auffusion** model and generate audio from a text prompt:
```python
import IPython, torch
import soundfile as sf
from auffusion_pipeline import AuffusionPipeline
pipeline = AuffusionPipeline.from_pretrained("auffusion/auffusion")
prompt = "Birds singing sweetly in a blooming garden"
output = pipeline(prompt=prompt)
audio = output.audios[0]
sf.write(f"{prompt}.wav", audio, samplerate=16000)
IPython.display.Audio(data=audio, rate=16000)
```
The auffusion model will be automatically downloaded from huggingface and saved in cache. Subsequent runs will load the model directly from cache.
The `generate` function uses 100 steps and 7.5 guidance_scale by default to sample from the latent diffusion model. You can also vary parameters for different results.
```python
prompt = "Rolling thunder with lightning strikes"
output = pipeline(prompt=prompt, num_inference_steps=100, guidance_scale=7.5)
audio = output.audios[0]
IPython.display.Audio(data=audio, rate=16000)
``` |