metadata

license: mit
language:
  - en
library_name: peft

Master Thesis: High-Fidelity Video Background Music Generation using Transformers

This is the corresponding GitLab Repository of my Master Thesis. The goal this thisis is to generate video background music by the adaptation of MusicGen (https://arxiv.org/pdf/2306.05284.pdf) to video input as another input modality. This should be accomplished by mapping video information into the T5 text embedding space on which MusicGen usually works on. To this end, a Transformer Encoder network to accomplish this task, called Video Encoder. Two options are foreseen within the training loop for the Video Encoder:

freezing the weights within the MusicGen Audio Decoder
adjusting the weights of the MusicGen Audio Decoder with Parameter Efficient Fine-Tuning (PEFT) using LoRA (https://arxiv.org/abs/2106.09685)

Installation

create a Python virtual environment with Python 3.11
check https://pytorch.org/get-started/previous-versions/ to install PyTorch 2.1.0 with CUDA on your machine
install the local fork of audiocraft: cd audiocraft; pip install -e .
install the other requirements: pip install -r requirements.txt

Folder Structure

audiocraft contains a local fork of the audiocraft library (https://github.com/facebookresearch/audiocraft) with little changes to the generation method, further information can be seen in code/code_adaptations_audiocraft.
code contains the code for model training and inference of video background music
datasets contains the code to create the datasets used for training within data_preparation and video examples used for the evaluation in example_videos
evaluation contains the code used to evaluate the datasets and created video embeddings
gradio_app contains the code for interface to generate video background music

Training

To train the models set the training parameters under training/training_conf.yml and start training with python training/training.py. The models weights will be stored under training/models_audiocraft or training/models_peft respectively.

Inference

start the user interface by running python gradio_app/app.py
inside the interface select a video, parameters
click on "submit" to start the generation

Contact

For any questions contact me at [email protected]