Riffsuion Fine-Tune
This is a Fine-Tuned version of Rifussion, trained on bass samples extracted from the NSynth dataset. The porpuse of this work is to evaluate the performance of the model to generate bass audio samples.
Notes
- This is the way I found to achieve this goal, if you have a better idea for doing this, please share it with me.
Quickstart Guide
Clone the Riffusion repository and install the requirements.txt file from: Riffusion Github
import torch
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained("DaveLoay/Riffusion_FT_Bass_512_4000", torch_dtype=torch.float16).to(device)
prompt = "Your desired prompt"
image = pipe(prompt).images[0]
After that, you would have been generated an spectrogram saved on image. So if you want to convert this image into an audio file, you could use the spectrogram_image_converter mehtod contained in the Rifussion repo.
from riffusion.spectrogram_image_converter import SpectrogramImageConverter
from riffusion.spectrogram_params import SpectrogramParams
params = SpectrogramParams()
converter = SpectrogramImageConverter(params)
audio = converter.audio_from_spectrogram_image(image)
Fine Tuning
For the Fine-Tuning process, I used the bass samples from the test split in the NSynth dataset, which you can check out here: NSynth Dataset
You can find the pre-processed files in my repo, here: DaveLoay/NSynth_Bass_Captions
And as mention in the official Rifussion HF repo, I used the train_text_to_image script contained in the Diffusers repo, which you can check out here: Diffusers Repo
After configuring all dependencies, I used the following code to train the model:
accelerate launch --mixed_precision="fp16" train_text_to_image.py \
--pretrained_model_name_or_path=riffusion/riffusion-model-v1 \
--dataset_name=DaveLoay/NSynth_Bass_Captions \
--resolution=512 \
--use_ema \
--train_batch_size=3 \
--gradient_accumulation_steps=4 \
--gradient_checkpointing \
--max_train_steps=4000 \
--learning_rate=1e-05 \
--max_grad_norm=1 \
--lr_scheduler="constant" --lr_warmup_steps=0 \
--output_dir="Riffusion_FT_Bass_512_4000" \
--push_to_hub
Hardware
The hardware I used to fine-tune this model is:
- NVIDIA A100 40 GB vRAM hosted in Google Colab Pro
It took about 3 hours to complete the training process, and used about ~26 GB of vRAM.
Credits
You can check the original repositories here:
- Downloads last month
- 9