Spaces:
Runtime error
Runtime error
update readme
Browse files
README.md
CHANGED
@@ -123,9 +123,9 @@ accelerate launch --config_file config/accelerate_sagemaker.yaml \
|
|
123 |
--mixed_precision no
|
124 |
```
|
125 |
## Latent Audio Diffusion
|
126 |
-
Rather than denoising images directly, it is interesting to work in the "latent space" after first encoding images using an autoencoder. This has a number of advantages. Firstly, the information in the images is compressed into a latent space of a much lower dimension, so it is much faster to train denoising diffusion models and run inference with them. Secondly, as the latent space is really a array (tensor) of guassian variables with a particular mean,
|
127 |
|
128 |
-
At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality, rather like its cousin `transformers` in the early days of development. In order to train a VAE (Variational Autoencoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format.
|
129 |
|
130 |
#### Train an autoencoder.
|
131 |
```bash
|
|
|
123 |
--mixed_precision no
|
124 |
```
|
125 |
## Latent Audio Diffusion
|
126 |
+
Rather than denoising images directly, it is interesting to work in the "latent space" after first encoding images using an autoencoder. This has a number of advantages. Firstly, the information in the images is compressed into a latent space of a much lower dimension, so it is much faster to train denoising diffusion models and run inference with them. Secondly, as the latent space is really a array (tensor) of guassian variables with a particular mean, the decoder is invariant to guassian noise. And thirdly, similar images tend to be clustered together and interpolating between two images in latent space can produce meaningful combinations.
|
127 |
|
128 |
+
At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality, rather like its cousin `transformers` in the early days of development. In order to train a VAE (Variational Autoencoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format. Note that it uses a perceptual loss function for images; it would be nice to try a perceptual *audio* loss function.
|
129 |
|
130 |
#### Train an autoencoder.
|
131 |
```bash
|