File size: 6,833 Bytes
598c444 3d54144 598c444 5a91258 598c444 19860b2 598c444 4cab6a4 598c444 ee1cbd9 598c444 4cab6a4 598c444 4cab6a4 598c444 ee1cbd9 598c444 ee1cbd9 598c444 ee1cbd9 598c444 f996db0 598c444 70e9e4d 598c444 f996db0 70e9e4d f996db0 598c444 5a91258 598c444 4cab6a4 598c444 f996db0 4cab6a4 f996db0 598c444 f996db0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 |
---
license: apache-2.0
datasets:
- amaai-lab/MidiCaps
tags:
- music
- text-to-music
- symbolic-music
---
<div align="center">
# Text2midi: Generating Symbolic Music from Captions
[Demo](https://huggingface.co/spaces/amaai-lab/text2midi) | [Model](https://huggingface.co/amaai-lab/text2midi) | [Github](https://github.com/AMAAI-Lab/Text2midi) | [Examples](https://aaaisubmission25.github.io/text2midi/) | [Paper](https://arxiv.org/abs/2412.16526) | [Dataset](https://huggingface.co/datasets/amaai-lab/MidiCaps)
[data:image/s3,"s3://crabby-images/0f3e3/0f3e361837647fdf401850b745b1a643f99cb0cd" alt="Hugging Face Spaces"](https://huggingface.co/spaces/amaai-lab/text2midi)
</div>
**text2midi** is the first end-to-end model for generating MIDI files from textual descriptions. By leveraging pretrained large language models and a powerful autoregressive transformer decoder, **text2midi** allows users to create symbolic music that aligns with detailed textual prompts, including musical attributes like chords, tempo, and style. The details of the model are described in [this paper](https://arxiv.org/abs/2412.16526).
π₯ Live demo available on [HuggingFace Spaces](https://huggingface.co/spaces/amaai-lab/text2midi).
<div align="center">
<img src="text2midi_architecture.jpg" width="500"/>
</div>
## Quickstart Guide
Generate symbolic music from a text prompt:
```python
import pickle
import torch
import torch.nn as nn
from transformers import T5Tokenizer
from model.transformer_model import Transformer
from huggingface_hub import hf_hub_download
repo_id = "amaai-lab/text2midi"
# Download the model.bin file
model_path = hf_hub_download(repo_id=repo_id, filename="pytorch_model.bin")
# Download the vocab_remi.pkl file
tokenizer_path = hf_hub_download(repo_id=repo_id, filename="vocab_remi.pkl")
if torch.cuda.is_available():
device = 'cuda'
elif torch.backends.mps.is_available():
device = 'mps'
else:
device = 'cpu'
print(f"Using device: {device}")
# Load the tokenizer dictionary
with open(tokenizer_path, "rb") as f:
r_tokenizer = pickle.load(f)
# Get the vocab size
vocab_size = len(r_tokenizer)
print("Vocab size: ", vocab_size)
model = Transformer(vocab_size, 768, 8, 2048, 18, 1024, False, 8, device=device)
model.load_state_dict(torch.load(model_path, map_location=device))
model.eval()
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
print('Model loaded.')
# Enter the text prompt and tokenize it
src = "A melodic electronic song with ambient elements, featuring piano, acoustic guitar, alto saxophone, string ensemble, and electric bass. Set in G minor with a 4/4 time signature, it moves at a lively Presto tempo. The composition evokes a blend of relaxation and darkness, with hints of happiness and a meditative quality."
print('Generating for prompt: ' + src)
inputs = tokenizer(src, return_tensors='pt', padding=True, truncation=True)
input_ids = nn.utils.rnn.pad_sequence(inputs.input_ids, batch_first=True, padding_value=0)
input_ids = input_ids.to(device)
attention_mask =nn.utils.rnn.pad_sequence(inputs.attention_mask, batch_first=True, padding_value=0)
attention_mask = attention_mask.to(device)
# Generate the midi
output = model.generate(input_ids, attention_mask, max_len=2000,temperature = 1.0)
output_list = output[0].tolist()
generated_midi = r_tokenizer.decode(output_list)
generated_midi.dump_midi("output.mid")
```
## Installation
If you have CUDA supported machine:
```bash
git clone https://github.com/AMAAI-Lab/text2midi
cd text2midi
pip install -r requirements.txt
```
Alternatively, if you have MPS supported machine:
```bash
git clone https://github.com/AMAAI-Lab/text2midi
cd text2midi
pip install -r requirements-mac.txt
```
## Datasets
The model was trained using two datasets: [SymphonyNet](https://symphonynet.github.io/) for semi-supervised pretraining and MidiCaps for finetuning towards MIDI generation from captions.
The [MidiCaps dataset](https://huggingface.co/datasets/amaai-lab/MidiCaps) is a large-scale dataset of 168k MIDI files paired with rich text captions. These captions contain musical attributes such as key, tempo, style, and mood, making it ideal for text-to-MIDI generation tasks as described in [this paper](https://arxiv.org/abs/2406.02255).
## Inference
We spport inference on CUDA, MPS and cpu. Please make sure you have pip installed the correct requirement file (requirments.txt for CUDA, requirements-mac.txt for MPS)
```bash
python model/transformer_model.py --caption <your intended descriptions>
```
## Citation
If you use text2midi in your research, please cite:
```
@inproceedings{bhandari2025text2midi,
title={text2midi: Generating Symbolic Music from Captions},
author={Keshav Bhandari and Abhinaba Roy and Kyra Wang and Geeta Puri and Simon Colton and Dorien Herremans},
booktitle={Proceedings of the 39th AAAI Conference on Artificial Intelligence (AAAI 2025)},
year={2025}
}
```
## Results of the Listening Study
Each question is rated on a Likert scale from 1 (very bad) to 7 (very good). The table shows the average ratings per question for each group of participants.
| Question | MidiCaps | text2midi | MuseCoco |
|---------------------|----------|-----------|----------|
| Musical Quality | 5.79 | 4.62 | 4.40 |
| Overall Matching | 5.42 | 4.67 | 4.07 |
| Genre Matching | 5.54 | 4.98 | 4.40 |
| Mood Matching | 5.70 | 5.00 | 4.32 |
| Key Matching | 4.61 | 3.64 | 3.36 |
| Chord Matching | 3.20 | 2.50 | 2.00 |
| Tempo Matching | 5.89 | 5.42 | 4.94 |
## Objective Evaluations
| Metric | text2midi | MidiCaps | MuseCoco |
|---------------------|-----------|----------|----------|
| CR β | 2.31 | 3.43 | 2.12 |
| CLAP β | 0.22 | 0.26 | 0.21 |
| TB (%) β | 39.70 | - | 21.71 |
| TBT (%) β | 65.80 | - | 54.63 |
| CK (%) β | 33.60 | - | 13.70 |
| CKD (%) β | 35.60 | - | 14.59 |
**Note**:
CR = Compression ratio
CLAP = CLAP score
TB = Tempo Bin
TBT = Tempo Bin with Tolerance
CK = Correct Key
CKD = Correct Key with Duplicates
β = Higher score is better.
## Training
To train text2midi, we recommend using accelerate for multi-GPU support. First, configure accelerate by running:
```bash
accelerate config
```
Then, use the following command to start training:
```bash
accelerate launch train.py \
--encoder_model="google/flan-t5-large" \
--decoder_model="configs/transformer_decoder_config.json" \
--dataset_name="amaai-lab/MidiCaps" \
--pretrain_dataset="amaai-lab/SymphonyNet" \
--batch_size=16 \
--learning_rate=1e-4 \
--epochs=40 \
```
|