Riffsuion Fine-Tune

This is a Fine-Tuned version of Rifussion, trained on bass samples extracted from the NSynth dataset. The porpuse of this work is to evaluate the performance of the model to generate bass audio samples.

Notes

This is the way I found to achieve this goal, if you have a better idea for doing this, please share it with me.

Quickstart Guide

Clone the Riffusion repository and install the requirements.txt file from: Riffusion Github

import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained("DaveLoay/Riffusion_FT_Bass_512_4000", torch_dtype=torch.float16).to(device)
prompt = "Your desired prompt"
image = pipe(prompt).images[0]

After that, you would have been generated an spectrogram saved on image. So if you want to convert this image into an audio file, you could use the spectrogram_image_converter mehtod contained in the Rifussion repo.

from riffusion.spectrogram_image_converter import SpectrogramImageConverter
from riffusion.spectrogram_params import SpectrogramParams

params = SpectrogramParams()
converter = SpectrogramImageConverter(params)
audio = converter.audio_from_spectrogram_image(image)

Fine Tuning

For the Fine-Tuning process, I used the bass samples from the test split in the NSynth dataset, which you can check out here: NSynth Dataset

You can find the pre-processed files in my repo, here: DaveLoay/NSynth_Bass_Captions

And as mention in the official Rifussion HF repo, I used the train_text_to_image script contained in the Diffusers repo, which you can check out here: Diffusers Repo

After configuring all dependencies, I used the following code to train the model:

  accelerate launch --mixed_precision="fp16"  train_text_to_image.py \
    --pretrained_model_name_or_path=riffusion/riffusion-model-v1 \
    --dataset_name=DaveLoay/NSynth_Bass_Captions \
    --resolution=512 \
    --use_ema \
    --train_batch_size=3 \
    --gradient_accumulation_steps=4 \
    --gradient_checkpointing \
    --max_train_steps=4000 \
    --learning_rate=1e-05 \
    --max_grad_norm=1 \
    --lr_scheduler="constant" --lr_warmup_steps=0 \
    --output_dir="Riffusion_FT_Bass_512_4000" \
    --push_to_hub