|
--- |
|
license: cc-by-nc-sa-4.0 |
|
--- |
|
|
|
# MusicLDM |
|
|
|
MusicLDM is a latent text-to-audio diffusion model capable of generating music samples from a text input. |
|
It is available in the 🧨 Diffusers library from v0.21.0 onwards. |
|
|
|
# Model Details |
|
|
|
MusicLDM was proposed in [MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies](https://huggingface.co/papers/2308.01546) by Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, Shlomo Dubnov. |
|
|
|
Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview) and [AudioLDM](https://huggingface.co/docs/diffusers/api/pipelines/audioldm/overview), |
|
MusicLDM is a text-to-music _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap) |
|
latents. |
|
|
|
MusicLDM is trained on a corpus of 466 hours of music data. Beat-synchronous data augmentation strategies are applied to |
|
the music samples, both in the time domain and in the latent space. Using beat-synchronous data augmentation strategies |
|
encourages the model to interpolate between the training samples, but stay within the domain of the training data. The |
|
result is generated music that is more diverse while staying faithful to the corresponding style. |
|
|
|
This work is licensed under a |
|
[Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-nc-sa/4.0/). |
|
|
|
## Model Sources |
|
|
|
- [**🧨 Diffusers Pipeline**](https://huggingface.co/docs/diffusers/api/pipelines/musicldm) |
|
- [**Paper**](https://huggingface.co/papers/2308.01546) |
|
- [**Demo**](https://musicldm.github.io) |
|
- [**Try It!!**](https://huggingface.co/spaces/ircam-reach/musicldm-text-to-music) |
|
|
|
# Usage |
|
|
|
First, install the required packages: |
|
|
|
``` |
|
pip install --upgrade diffusers transformers accelerate |
|
``` |
|
|
|
## Text-to-Music |
|
|
|
For text-to-music generation, the [MusicLDMPipeline](https://huggingface.co/docs/diffusers/api/pipelines/musicldm) can be |
|
used to load pre-trained weights and generate text-conditional audio outputs: |
|
|
|
```python |
|
from diffusers import MusicLDMPipeline |
|
import torch |
|
|
|
repo_id = "ucsd-reach/musicldm" |
|
pipe = MusicLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16) |
|
pipe = pipe.to("cuda") |
|
|
|
prompt = "Techno music with a strong, upbeat tempo and high melodic riffs" |
|
audio = pipe(prompt, num_inference_steps=200, audio_length_in_s=10.0).audios[0] |
|
``` |
|
|
|
The resulting audio output can be saved as a .wav file: |
|
```python |
|
import scipy |
|
|
|
scipy.io.wavfile.write("techno.wav", rate=16000, data=audio) |
|
``` |
|
|
|
Or displayed in a Jupyter Notebook / Google Colab: |
|
```python |
|
from IPython.display import Audio |
|
|
|
Audio(audio, rate=16000) |
|
``` |
|
|
|
## Tips |
|
|
|
When constructing a prompt, keep in mind: |
|
|
|
* Descriptive prompt inputs work best; use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific where possible (e.g. "melodic techno with a fast beat and synths" works better than "techno"). |
|
* Using a *negative prompt* can significantly improve the quality of the generated audio. Try using a negative prompt of "low quality, average quality". |
|
|
|
During inference: |
|
|
|
* The _quality_ of the generated audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference. |
|
* Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1 to enable. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly. |
|
* The _length_ of the generated audio sample can be controlled by varying the `audio_length_in_s` argument. |
|
|
|
The following example demonstrates how to construct a good audio generation using the aforementioned tips: |
|
|
|
```python |
|
import scipy |
|
import torch |
|
from diffusers import MusicLDMPipeline |
|
|
|
# load the pipeline |
|
repo_id = "ucsd-reach/musicldm" |
|
pipe = MusicLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16) |
|
pipe = pipe.to("cuda") |
|
|
|
# define the prompts |
|
prompt = "Techno music with a strong, upbeat tempo and high melodic riffs" |
|
negative_prompt = "low quality, average quality" |
|
|
|
# set the seed |
|
generator = torch.Generator("cuda").manual_seed(0) |
|
|
|
# run the generation |
|
audio = pipe( |
|
prompt, |
|
negative_prompt=negative_prompt, |
|
num_inference_steps=200, |
|
audio_length_in_s=10.0, |
|
num_waveforms_per_prompt=3, |
|
).audios |
|
|
|
# save the best audio sample (index 0) as a .wav file |
|
scipy.io.wavfile.write("techno.wav", rate=16000, data=audio[0]) |
|
``` |
|
|
|
# Citation |
|
|
|
**BibTeX:** |
|
``` |
|
@article{chen2023musicldm, |
|
title={"MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies"}, |
|
author={Chen*, Ke and Wu*, Yusong and Liu*, Haohe and Nezhurina, Marianna and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo}, |
|
journal={arXiv preprint arXiv:2308.01546}, |
|
year={2023} |
|
} |
|
``` |
|
|