File size: 4,977 Bytes
007b722 773b2c9 007b722 6c7c075 007b722 2856b33 007b722 ec9a51c 007b722 b8135e3 007b722 b8135e3 007b722 2712f79 2856b33 2712f79 007b722 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
---
license: cc-by-nc-sa-4.0
---
# MusicLDM
MusicLDM is a latent text-to-audio diffusion model capable of generating music samples from a text input.
It is available in the 🧨 Diffusers library from v0.21.0 onwards.
# Model Details
MusicLDM was proposed in [MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies](https://huggingface.co/papers/2308.01546) by Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview) and [AudioLDM](https://huggingface.co/docs/diffusers/api/pipelines/audioldm/overview),
MusicLDM is a text-to-music _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)
latents.
MusicLDM is trained on a corpus of 466 hours of music data. Beat-synchronous data augmentation strategies are applied to
the music samples, both in the time domain and in the latent space. Using beat-synchronous data augmentation strategies
encourages the model to interpolate between the training samples, but stay within the domain of the training data. The
result is generated music that is more diverse while staying faithful to the corresponding style.
This work is licensed under a
[Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-nc-sa/4.0/).
## Model Sources
- [**🧨 Diffusers Pipeline**](https://huggingface.co/docs/diffusers/api/pipelines/musicldm)
- [**Paper**](https://huggingface.co/papers/2308.01546)
- [**Demo**](https://musicldm.github.io)
- [**Try It!!**](https://huggingface.co/spaces/ircam-reach/musicldm-text-to-music)
# Usage
First, install the required packages:
```
pip install --upgrade diffusers transformers accelerate
```
## Text-to-Music
For text-to-music generation, the [MusicLDMPipeline](https://huggingface.co/docs/diffusers/api/pipelines/musicldm) can be
used to load pre-trained weights and generate text-conditional audio outputs:
```python
from diffusers import MusicLDMPipeline
import torch
repo_id = "ucsd-reach/musicldm"
pipe = MusicLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
audio = pipe(prompt, num_inference_steps=200, audio_length_in_s=10.0).audios[0]
```
The resulting audio output can be saved as a .wav file:
```python
import scipy
scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
```
Or displayed in a Jupyter Notebook / Google Colab:
```python
from IPython.display import Audio
Audio(audio, rate=16000)
```
## Tips
When constructing a prompt, keep in mind:
* Descriptive prompt inputs work best; use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific where possible (e.g. "melodic techno with a fast beat and synths" works better than "techno").
* Using a *negative prompt* can significantly improve the quality of the generated audio. Try using a negative prompt of "low quality, average quality".
During inference:
* The _quality_ of the generated audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference.
* Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1 to enable. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly.
* The _length_ of the generated audio sample can be controlled by varying the `audio_length_in_s` argument.
The following example demonstrates how to construct a good audio generation using the aforementioned tips:
```python
import scipy
import torch
from diffusers import MusicLDMPipeline
# load the pipeline
repo_id = "ucsd-reach/musicldm"
pipe = MusicLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
# define the prompts
prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
negative_prompt = "low quality, average quality"
# set the seed
generator = torch.Generator("cuda").manual_seed(0)
# run the generation
audio = pipe(
prompt,
negative_prompt=negative_prompt,
num_inference_steps=200,
audio_length_in_s=10.0,
num_waveforms_per_prompt=3,
).audios
# save the best audio sample (index 0) as a .wav file
scipy.io.wavfile.write("techno.wav", rate=16000, data=audio[0])
```
# Citation
**BibTeX:**
```
@article{chen2023musicldm,
title={"MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies"},
author={Chen*, Ke and Wu*, Yusong and Liu*, Haohe and Nezhurina, Marianna and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo},
journal={arXiv preprint arXiv:2308.01546},
year={2023}
}
```
|