text2midi / README.md

Update README.md

5a91258 verified about 1 month ago

6.83 kB

	---
	license: apache-2.0
	datasets:
	- amaai-lab/MidiCaps
	tags:
	- music
	- text-to-music
	- symbolic-music
	---

	<div align="center">

	# Text2midi: Generating Symbolic Music from Captions

	[Demo](https://huggingface.co/spaces/amaai-lab/text2midi) \| [Model](https://huggingface.co/amaai-lab/text2midi) \| [Github](https://github.com/AMAAI-Lab/Text2midi) \| [Examples](https://aaaisubmission25.github.io/text2midi/) \| [Paper](https://arxiv.org/abs/2412.16526) \| [Dataset](https://huggingface.co/datasets/amaai-lab/MidiCaps)

	[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/amaai-lab/text2midi)
	</div>

	text2midi is the first end-to-end model for generating MIDI files from textual descriptions. By leveraging pretrained large language models and a powerful autoregressive transformer decoder, text2midi allows users to create symbolic music that aligns with detailed textual prompts, including musical attributes like chords, tempo, and style. The details of the model are described in [this paper](https://arxiv.org/abs/2412.16526).

	🔥 Live demo available on [HuggingFace Spaces](https://huggingface.co/spaces/amaai-lab/text2midi).

	<div align="center">
	<img src="text2midi_architecture.jpg" width="500"/>
	</div>

	## Quickstart Guide

	Generate symbolic music from a text prompt:

	```python
	import pickle
	import torch
	import torch.nn as nn
	from transformers import T5Tokenizer
	from model.transformer_model import Transformer
	from huggingface_hub import hf_hub_download

	repo_id = "amaai-lab/text2midi"
	# Download the model.bin file
	model_path = hf_hub_download(repo_id=repo_id, filename="pytorch_model.bin")
	# Download the vocab_remi.pkl file
	tokenizer_path = hf_hub_download(repo_id=repo_id, filename="vocab_remi.pkl")

	if torch.cuda.is_available():
	device = 'cuda'
	elif torch.backends.mps.is_available():
	device = 'mps'
	else:
	device = 'cpu'

	print(f"Using device: {device}")

	# Load the tokenizer dictionary
	with open(tokenizer_path, "rb") as f:
	r_tokenizer = pickle.load(f)

	# Get the vocab size
	vocab_size = len(r_tokenizer)
	print("Vocab size: ", vocab_size)
	model = Transformer(vocab_size, 768, 8, 2048, 18, 1024, False, 8, device=device)
	model.load_state_dict(torch.load(model_path, map_location=device))
	model.eval()
	tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")

	print('Model loaded.')


	# Enter the text prompt and tokenize it
	src = "A melodic electronic song with ambient elements, featuring piano, acoustic guitar, alto saxophone, string ensemble, and electric bass. Set in G minor with a 4/4 time signature, it moves at a lively Presto tempo. The composition evokes a blend of relaxation and darkness, with hints of happiness and a meditative quality."
	print('Generating for prompt: ' + src)

	inputs = tokenizer(src, return_tensors='pt', padding=True, truncation=True)
	input_ids = nn.utils.rnn.pad_sequence(inputs.input_ids, batch_first=True, padding_value=0)
	input_ids = input_ids.to(device)
	attention_mask =nn.utils.rnn.pad_sequence(inputs.attention_mask, batch_first=True, padding_value=0)
	attention_mask = attention_mask.to(device)

	# Generate the midi
	output = model.generate(input_ids, attention_mask, max_len=2000,temperature = 1.0)
	output_list = output[0].tolist()
	generated_midi = r_tokenizer.decode(output_list)
	generated_midi.dump_midi("output.mid")
	```

	## Installation
	If you have CUDA supported machine:
	```bash
	git clone https://github.com/AMAAI-Lab/text2midi
	cd text2midi
	pip install -r requirements.txt
	```
	Alternatively, if you have MPS supported machine:
	```bash
	git clone https://github.com/AMAAI-Lab/text2midi
	cd text2midi
	pip install -r requirements-mac.txt
	```

	## Datasets

	The model was trained using two datasets: [SymphonyNet](https://symphonynet.github.io/) for semi-supervised pretraining and MidiCaps for finetuning towards MIDI generation from captions.
	The [MidiCaps dataset](https://huggingface.co/datasets/amaai-lab/MidiCaps) is a large-scale dataset of 168k MIDI files paired with rich text captions. These captions contain musical attributes such as key, tempo, style, and mood, making it ideal for text-to-MIDI generation tasks as described in [this paper](https://arxiv.org/abs/2406.02255).


	## Inference

	We spport inference on CUDA, MPS and cpu. Please make sure you have pip installed the correct requirement file (requirments.txt for CUDA, requirements-mac.txt for MPS)
	```bash
	python model/transformer_model.py --caption <your intended descriptions>
	```

	## Citation

	If you use text2midi in your research, please cite:
	```
	@inproceedings{bhandari2025text2midi,
	title={text2midi: Generating Symbolic Music from Captions},
	author={Keshav Bhandari and Abhinaba Roy and Kyra Wang and Geeta Puri and Simon Colton and Dorien Herremans},
	booktitle={Proceedings of the 39th AAAI Conference on Artificial Intelligence (AAAI 2025)},
	year={2025}
	}
	```

	## Results of the Listening Study

	Each question is rated on a Likert scale from 1 (very bad) to 7 (very good). The table shows the average ratings per question for each group of participants.

	\| Question \| MidiCaps \| text2midi \| MuseCoco \|
	\|---------------------\|----------\|-----------\|----------\|
	\| Musical Quality \| 5.79 \| 4.62 \| 4.40 \|
	\| Overall Matching \| 5.42 \| 4.67 \| 4.07 \|
	\| Genre Matching \| 5.54 \| 4.98 \| 4.40 \|
	\| Mood Matching \| 5.70 \| 5.00 \| 4.32 \|
	\| Key Matching \| 4.61 \| 3.64 \| 3.36 \|
	\| Chord Matching \| 3.20 \| 2.50 \| 2.00 \|
	\| Tempo Matching \| 5.89 \| 5.42 \| 4.94 \|


	## Objective Evaluations

	\| Metric \| text2midi \| MidiCaps \| MuseCoco \|
	\|---------------------\|-----------\|----------\|----------\|
	\| CR ↑ \| 2.31 \| 3.43 \| 2.12 \|
	\| CLAP ↑ \| 0.22 \| 0.26 \| 0.21 \|
	\| TB (%) ↑ \| 39.70 \| - \| 21.71 \|
	\| TBT (%) ↑ \| 65.80 \| - \| 54.63 \|
	\| CK (%) ↑ \| 33.60 \| - \| 13.70 \|
	\| CKD (%) ↑ \| 35.60 \| - \| 14.59 \|

	Note:
	CR = Compression ratio
	CLAP = CLAP score
	TB = Tempo Bin
	TBT = Tempo Bin with Tolerance
	CK = Correct Key
	CKD = Correct Key with Duplicates
	↑ = Higher score is better.

	## Training
	To train text2midi, we recommend using accelerate for multi-GPU support. First, configure accelerate by running:
	```bash
	accelerate config
	```

	Then, use the following command to start training:
	```bash
	accelerate launch train.py \
	--encoder_model="google/flan-t5-large" \
	--decoder_model="configs/transformer_decoder_config.json" \
	--dataset_name="amaai-lab/MidiCaps" \
	--pretrain_dataset="amaai-lab/SymphonyNet" \
	--batch_size=16 \
	--learning_rate=1e-4 \
	--epochs=40 \
	```