Sound-AI-SFX

Running

Sound-AI-SFX / diffusers /docs /source /en /api /pipelines /audioldm.mdx

hungchiayu1

initial commit

ffead1e about 1 year ago

3.99 kB

	<!--Copyright 2023 The HuggingFace Team. All rights reserved.

	Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
	an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
	specific language governing permissions and limitations under the License.
	-->

	# AudioLDM

	## Overview

	AudioLDM was proposed in [AudioLDM: Text-to-Audio Generation with Latent Diffusion Models](https://arxiv.org/abs/2301.12503) by Haohe Liu et al.

	Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM
	is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)
	latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional
	sound effects, human speech and music.

	This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). The original codebase can be found [here](https://github.com/haoheliu/AudioLDM).

	## Text-to-Audio

	The [`AudioLDMPipeline`] can be used to load pre-trained weights from [cvssp/audioldm](https://huggingface.co/cvssp/audioldm) and generate text-conditional audio outputs:

	```python
	from diffusers import AudioLDMPipeline
	import torch
	import scipy

	repo_id = "cvssp/audioldm"
	pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
	pipe = pipe.to("cuda")

	prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
	audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]

	# save the audio sample as a .wav file
	scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
	```

	### Tips

	Prompts:
	* Descriptive prompt inputs work best: you can use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g., "water stream in a forest" instead of "stream").
	* It's best to use general terms like 'cat' or 'dog' instead of specific names or abstract objects that the model may not be familiar with.

	Inference:
	* The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument: higher steps give higher quality audio at the expense of slower inference.
	* The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument.

	### How to load and use different schedulers

	The AudioLDM pipeline uses [`DDIMScheduler`] scheduler by default. But `diffusers` provides many other schedulers
	that can be used with the AudioLDM pipeline such as [`PNDMScheduler`], [`LMSDiscreteScheduler`], [`EulerDiscreteScheduler`],
	[`EulerAncestralDiscreteScheduler`] etc. We recommend using the [`DPMSolverMultistepScheduler`] as it's currently the fastest
	scheduler there is.

	To use a different scheduler, you can either change it via the [`ConfigMixin.from_config`]
	method, or pass the `scheduler` argument to the `from_pretrained` method of the pipeline. For example, to use the
	[`DPMSolverMultistepScheduler`], you can do the following:

	```python
	>>> from diffusers import AudioLDMPipeline, DPMSolverMultistepScheduler
	>>> import torch

	>>> pipeline = AudioLDMPipeline.from_pretrained("cvssp/audioldm", torch_dtype=torch.float16)
	>>> pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)

	>>> # or
	>>> dpm_scheduler = DPMSolverMultistepScheduler.from_pretrained("cvssp/audioldm", subfolder="scheduler")
	>>> pipeline = AudioLDMPipeline.from_pretrained("cvssp/audioldm", scheduler=dpm_scheduler, torch_dtype=torch.float16)
	```

	## AudioLDMPipeline
	[[autodoc]] AudioLDMPipeline
	- all
	- __call__