File size: 3,002 Bytes
4b16ff2
825c8bf
 
 
 
 
 
 
 
 
 
 
 
7c89b23
825c8bf
1dea888
825c8bf
 
 
 
 
65fa65c
825c8bf
7c89b23
825c8bf
65fa65c
1dea888
65fa65c
 
825c8bf
 
 
 
7c89b23
825c8bf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7c89b23
825c8bf
 
 
 
 
 
 
 
 
 
 
 
 
 
1dea888
825c8bf
7c89b23
825c8bf
1dea888
825c8bf
 
 
65fa65c
 
 
 
 
 
 
 
1dea888
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# audio-diffusion

### Apply [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239) using the new Hugging Face [diffusers](https://github.com/huggingface/diffusers) package to synthesize music instead of images.

---

![mel spectrogram](mel.png)

Audio can be represented as images by transforming to a [mel spectrogram](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum), such as the one shown above. The class `Mel` in `mel.py` can convert a slice of audio into a mel spectrogram of `x_res` x `y_res` and vice-versa. The higher the resolution, the less audio information will be lost. You can see how this works in the `test-mel.ipynb` notebook.

A DDPM model is trained on a set of mel spectrograms that have been generated from a directory of audio files. It is then used to synthesize similar mel spectrograms, which are then converted back into audio. See the `test-model.ipynb` notebook for an example.

## Generate Mel spectrogram dataset from directory of audio files
#### Training can be run with Mel spectrograms of resolution 64x64 on a single commercial grade GPU (e.g. RTX 2080 Ti). The `hop_length` should be set to 1024 for better results.

```bash
python src/audio_to_images.py \
  --resolution 64 \
  --hop_length 1024\
  --input_dir path-to-audio-files \
  --output_dir data-test
```

#### Generate dataset of 256x256 Mel spectrograms and push to hub (you will need to be authenticated with `huggingface-cli login`).

```bash
python src/audio_to_images.py \
  --resolution 256 \
  --input_dir path-to-audio-files \
  --output_dir data-256 \
  --push_to_hub teticio\audio-diffusion-256
```
## Train model
#### Run training on local machine.

```bash
accelerate launch --config_file accelerate_local.yaml \
  src/train_unconditional.py \
  --dataset_name data-64 \
  --resolution 64 \
  --hop_length 1024 \
  --output_dir ddpm-ema-audio-64 \
  --train_batch_size 16 \
  --num_epochs 100 \
  --gradient_accumulation_steps 1 \
  --learning_rate 1e-4 \
  --lr_warmup_steps 500 \
  --mixed_precision no
```

#### Run training on local machine with `batch_size` of 1 and `gradient_accumulation_steps` 16 to compensate, so that 256x256 resolution model fits on commercial grade GPU.

```bash
accelerate launch --config_file accelerate_local.yaml \
  src/train_unconditional.py \
  --dataset_name teticio/audio-diffusion-256 \
  --resolution 256 \
  --output_dir ddpm-ema-audio-256 \
  --num_epochs 100 \
  --train_batch_size 1 \
  --eval_batch_size 1 \
  --gradient_accumulation_steps 16 \
  --learning_rate 1e-4 \
  --lr_warmup_steps 500 \
  --mixed_precision no
```

#### Run training on SageMaker.

```bash
accelerate launch --config_file accelerate_sagemaker.yaml \
  src/train_unconditional.py \
  --dataset_name teticio/audio-diffusion-256 \
  --resolution 256 \
  --output_dir ddpm-ema-audio-256 \
  --train_batch_size 16 \
  --num_epochs 100 \
  --gradient_accumulation_steps 1 \
  --learning_rate 1e-4 \
  --lr_warmup_steps 500 \
  --mixed_precision no
```