Update README.md
Browse files
README.md
CHANGED
@@ -102,129 +102,158 @@ pipeline_tag: feature-extraction
|
|
102 |
tags:
|
103 |
- music
|
104 |
---
|
105 |
-
# CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages
|
106 |
|
107 |
<p align="center">
|
108 |
<img src="overview.png" alt="CLaMP 3 Overview" width="50%">
|
109 |
</p>
|
110 |
|
|
|
|
|
111 |
|
112 |
-
|
113 |
-
|
|
|
|
|
|
|
114 |
|
115 |
-
- **
|
116 |
-
|
117 |
-
2. **Performance Signals:** Processes MIDI Text Format (MTF) data.
|
118 |
-
3. **Audio Recordings:** Works with audio features extracted by [MERT](https://arxiv.org/abs/2306.00107).
|
119 |
|
120 |
-
- **
|
|
|
|
|
121 |
|
122 |
-
|
123 |
-
|
124 |
-
|
|
|
|
|
|
|
|
|
125 |
|
126 |
-
|
127 |
-
- **Semantic retrieval:** Searching for music based on descriptive text or retrieving textual metadata based on audio or symbolic representations.
|
128 |
-
- **Zero-shot classification:** Categorizing music by genre, region, or other attributes without labeled training data.
|
129 |
-
- **Music quality assessment:** Measuring the **semantic distance** between reference ground truth and generated music using metrics analogous to **Fréchet Inception Distance (FID)**, providing an objective alternative to human evaluation.
|
130 |
-
- **Evaluation of generative models:** Assessing **text-to-music generation**, **music captioning**, and **symbolic-to-audio synthesis** models by quantifying their alignment across different modalities.
|
131 |
-
- **Computational musicology:** Enabling studies in **geographical musicology**, analyzing regional distributions and cross-cultural influences through large-scale multimodal datasets.
|
132 |
|
133 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
134 |
|
135 |
-
|
136 |
-
- CLaMP 3 Demo Page (Coming Soon...)
|
137 |
-
- CLaMP 3 Paper (Coming Soon...)
|
138 |
-
- [CLaMP 3 Code](https://github.com/sanderwood/clamp3)
|
139 |
-
- [CLaMP 3 Model Weights](https://huggingface.co/sander-wood/clamp3/tree/main)
|
140 |
-
- [M4-RAG Pre-training Dataset](https://huggingface.co/datasets/sander-wood/m4-rag)
|
141 |
-
- [WikiMT-X Evaluation Benchmark](https://huggingface.co/datasets/sander-wood/wikimt-x)
|
142 |
|
143 |
-
|
|
|
|
|
|
|
|
|
144 |
|
145 |
-
##
|
146 |
-
|
147 |
-
|
148 |
-
|
|
|
|
|
|
|
149 |
|
150 |
-
|
|
|
|
|
151 |
|
152 |
-
|
153 |
|
154 |
-
|
155 |
|
156 |
-
|
157 |
|
158 |
-
|
159 |
-
|
|
|
|
|
|
|
|
|
|
|
160 |
|
161 |
```bash
|
162 |
-
|
163 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
164 |
```
|
165 |
|
166 |
-
|
167 |
-
|
168 |
-
1. **Interleaved ABC Notation**:
|
169 |
-
- Convert MusicXML files to ABC using [batch_xml2abc.py](https://github.com/sanderwood/clamp3/blob/main/preprocessing/abc/batch_xml2abc.py).
|
170 |
-
- Process the ABC files into interleaved notation using [batch_interleaved_abc.py](https://github.com/sanderwood/clamp3/blob/main/preprocessing/abc/batch_interleaved_abc.py).
|
171 |
-
2. **MTF**:
|
172 |
-
- Convert MIDI files to MTF format using [batch_midi2mtf.py](https://github.com/sanderwood/clamp3/blob/main/preprocessing/midi/batch_midi2mtf.py).
|
173 |
-
3. **MERT-extracted Audio Features**:
|
174 |
-
- Extract audio features using MERT by running the scripts [extract_mert.py](https://github.com/sanderwood/clamp3/tree/main/preprocessing/audio/extract_mert.py). These features will be saved as `.npy` files and are ready for use in CLaMP 3.
|
175 |
-
|
176 |
-
2. **Prepare Text Metadata (Optional)**: If you plan to train the model, you will need to prepare corresponding metadata for each music file. The metadata should be in JSON format, containing details like title, artist, region, language, and description.
|
177 |
-
|
178 |
-
Example:
|
179 |
-
```json
|
180 |
-
{
|
181 |
-
"filepaths": ["audio/--/---aL9TdeI4.npy"],
|
182 |
-
"id": "---aL9TdeI4",
|
183 |
-
"title": "Mairi's Wedding",
|
184 |
-
"artists": ["Noel McLoughlin"],
|
185 |
-
"region": "United Kingdom of Great Britain and Northern Ireland",
|
186 |
-
"language": "English",
|
187 |
-
"genres": ["Folk", "Traditional"],
|
188 |
-
"tags": ["Scottish", "Wedding", "Traditional", "Folk", "Celtic"],
|
189 |
-
"background": "Mairi's Wedding is a Scottish folk song...",
|
190 |
-
"analysis": "The song has a lively and upbeat Scottish folk rhythm...",
|
191 |
-
"description": "A traditional folk song with a joyful celebration...",
|
192 |
-
"scene": "The setting is a picturesque Scottish village on a sunny morning...",
|
193 |
-
"translations": { "language": "Vietnamese", "background": "Bài hát \"Đám Cưới Mairi\"..." }
|
194 |
-
}
|
195 |
-
```
|
196 |
-
|
197 |
-
Once your JSON files are ready, merge them into a single `.jsonl` file and structure the directories as shown:
|
198 |
-
|
199 |
-
```
|
200 |
-
/your-data-folder/
|
201 |
-
├── abc/
|
202 |
-
├── audio/
|
203 |
-
├── mtf/
|
204 |
-
├── merged_output.jsonl
|
205 |
-
```
|
206 |
-
|
207 |
-
### Training and Feature Extraction
|
208 |
-
2. **Training Models**: If you want to train CLaMP 3, check the training scripts in the [code/](https://github.com/sanderwood/clamp3/tree/main/code) folder and modify the [config.py](https://github.com/sanderwood/clamp3/blob/main/code/config.py) file to set your hyperparameters and data paths.
|
209 |
-
|
210 |
-
3. **Extracting Features**: After training (or if you have pre-trained weights), extract features from **preprocessed** data using [extract_clamp3.py](https://github.com/sanderwood/clamp3/blob/main/code/extract_clamp3.py). The script automatically detects the modality based on the file extension (e.g., `.txt`, `.abc`, `.mtf`, `.npy`). Make sure your data has already been converted into CLaMP 3–compatible formats by following the scripts in the [preprocessing/](https://github.com/sanderwood/clamp3/tree/main/preprocessing)` folder.
|
211 |
-
|
212 |
-
### Classification and Retrieval
|
213 |
-
4. **Classification**: To perform classification on the extracted features, navigate to the [classification/](https://github.com/sanderwood/clamp3/tree/main/classification) directory. You’ll find scripts for training and making predictions using linear classification models.
|
214 |
-
|
215 |
-
5. **Semantic Search**: To conduct semantic searches using the extracted features, refer to the scripts in the [retrieval/](https://github.com/sanderwood/clamp3/tree/main/retrieval) folder.
|
216 |
-
|
217 |
-
## Citation
|
218 |
-
Coming Soon...
|
219 |
-
<!-- If you use CLaMP 3, M4-RAG, or WikiMT-X in your research, please cite the following paper:
|
220 |
-
|
221 |
-
bibtex
|
222 |
-
@misc{wu2024clamp2multimodalmusic,
|
223 |
-
title={CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models},
|
224 |
-
author={Shangda Wu and Yashan Wang and Ruibin Yuan and Zhancheng Guo and Xu Tan and Ge Zhang and Monan Zhou and Jing Chen and Xuefeng Mu and Yuejie Gao and Yuanliang Dong and Jiafeng Liu and Xiaobing Li and Feng Yu and Maosong Sun},
|
225 |
-
year={2024},
|
226 |
-
eprint={2410.13267},
|
227 |
-
archivePrefix={arXiv},
|
228 |
-
primaryClass={cs.SD},
|
229 |
-
url={https://arxiv.org/abs/2410.13267},
|
230 |
-
} -->
|
|
|
102 |
tags:
|
103 |
- music
|
104 |
---
|
105 |
+
# **CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages**
|
106 |
|
107 |
<p align="center">
|
108 |
<img src="overview.png" alt="CLaMP 3 Overview" width="50%">
|
109 |
</p>
|
110 |
|
111 |
+
## **Overview**
|
112 |
+
CLaMP 3 is a **multimodal and multilingual** framework for **music information retrieval (MIR)**. By using **contrastive learning**, it aligns **sheet music, audio, performance signals, and multilingual text** into a **shared representation space**, enabling retrieval across unaligned musical modalities.
|
113 |
|
114 |
+
### **Key Features**
|
115 |
+
- **Multimodal Support:**
|
116 |
+
- **Sheet Music:** Uses **Interleaved ABC notation**, with a context size of **512 bars**.
|
117 |
+
- **Performance Signals:** Processes **MIDI Text Format (MTF)** data, with a context size of **512 MIDI messages**.
|
118 |
+
- **Audio Recordings:** Works with features extracted by **[MERT](https://arxiv.org/abs/2306.00107)**, with a context size of **640 seconds of audio**.
|
119 |
|
120 |
+
- **Multilingual Capabilities:**
|
121 |
+
- Trained on **27 languages** and generalizes to all **100 languages** supported by **[XLM-R](https://arxiv.org/abs/1911.02116)**.
|
|
|
|
|
122 |
|
123 |
+
- **Datasets & Benchmarking:**
|
124 |
+
- **[M4-RAG](https://huggingface.co/datasets/sander-wood/m4-rag):** A **large-scale** dataset of **2.31M high-quality music-text pairs** across 27 languages and 194 countries.
|
125 |
+
- **[WikiMT-X](https://huggingface.co/datasets/sander-wood/wikimt-x):** A MIR benchmark containing **1,000 triplets** of sheet music, audio, and diverse text annotations.
|
126 |
|
127 |
+
### **Applications**
|
128 |
+
CLaMP 3 supports a **wide range of music research tasks**, including but not limited to:
|
129 |
+
- **Semantic Retrieval:** Find music based on **descriptions** or retrieve textual metadata for **audio or symbolic** inputs.
|
130 |
+
- **Zero-Shot Classification:** Categorize **music by genre, region, or other attributes** without labeled training data.
|
131 |
+
- **Music Quality Assessment:** Compute the **semantic distance** between reference and generated music features, similar to **Fréchet Inception Distance (FID)**.
|
132 |
+
- **Cross-Modal Generative Model Evaluation:** Assess **text-to-music generation, music captioning**, and **symbolic-to-audio synthesis** models.
|
133 |
+
- **Computational Musicology:** By visualizing the distribution of data within the **shared representation space**, researchers can explore regional music patterns, stylistic similarities, and cross-cultural influences.
|
134 |
|
135 |
+
Importantly, these applications are **not restricted to any specific music modality or language**, making CLaMP 3 a powerful tool for **diverse music AI research**.
|
|
|
|
|
|
|
|
|
|
|
136 |
|
137 |
+
## **Links**
|
138 |
+
- **CLaMP 3 Demo Page** *(Coming Soon...)*
|
139 |
+
- **CLaMP 3 Paper** *(Coming Soon...)*
|
140 |
+
- **[CLaMP 3 Code](https://github.com/sanderwood/clamp3)**
|
141 |
+
- **[CLaMP 3 Model Weights](https://huggingface.co/sander-wood/clamp3/tree/main)**
|
142 |
+
- **[M4-RAG Pre-training Dataset](https://huggingface.co/datasets/sander-wood/m4-rag)**
|
143 |
+
- **[WikiMT-X Evaluation Benchmark](https://huggingface.co/datasets/sander-wood/wikimt-x)**
|
144 |
|
145 |
+
> **Note:** Ensure the model weights are placed in the `code/` folder, and verify the **configuration hyperparameters** before use.
|
|
|
|
|
|
|
|
|
|
|
|
|
146 |
|
147 |
+
## **Repository Structure**
|
148 |
+
- **[code/](https://github.com/sanderwood/clamp3/tree/main/code)** → Training & feature extraction scripts.
|
149 |
+
- **[classification/](https://github.com/sanderwood/clamp3/tree/main/classification)** → Linear classification training and prediction.
|
150 |
+
- **[preprocessing/](https://github.com/sanderwood/clamp3/tree/main/preprocessing)** → Convert data into **Interleaved ABC, MTF, or MERT-extracted features**.
|
151 |
+
- **[retrieval/](https://github.com/sanderwood/clamp3/tree/main/retrieval)** → Semantic search, retrieval evaluation, and similarity calculations.
|
152 |
|
153 |
+
## **Getting Started**
|
154 |
+
### **Environment Setup**
|
155 |
+
To set up the environment for CLaMP 3, run:
|
156 |
+
```bash
|
157 |
+
conda env create -f environment.yml
|
158 |
+
conda activate clamp3
|
159 |
+
```
|
160 |
|
161 |
+
### **Data Preparation**
|
162 |
+
#### **1. Convert Music Data to Compatible Formats**
|
163 |
+
Before using CLaMP 3, preprocess **MusicXML files** into **Interleaved ABC**, **MIDI files** into **MTF**, and **audio files** into **MERT-extracted features**.
|
164 |
|
165 |
+
> **Note:** Each script requires a manual edit of the `input_dir` variable at the top of the file before running, **except for the MERT extraction script (`extract_mert.py`), which takes command-line arguments for input and output paths.**
|
166 |
|
167 |
+
##### **1.1 Convert MusicXML to Interleaved ABC Notation**
|
168 |
|
169 |
+
CLaMP 3 requires **Interleaved ABC notation** for sheet music. To achieve this, first, convert **MusicXML** (`.mxl`, `.xml`, `.musicxml`) to **standard ABC** using [`batch_xml2abc.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/abc/batch_xml2abc.py):
|
170 |
|
171 |
+
```bash
|
172 |
+
python batch_xml2abc.py
|
173 |
+
```
|
174 |
+
- **Input:** `.mxl`, `.xml`, `.musicxml`
|
175 |
+
- **Output:** `.abc` (Standard ABC)
|
176 |
+
|
177 |
+
Next, process the standard ABC files into **Interleaved ABC notation** using [`batch_interleaved_abc.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/abc/batch_interleaved_abc.py):
|
178 |
|
179 |
```bash
|
180 |
+
python batch_interleaved_abc.py
|
181 |
+
```
|
182 |
+
- **Input:** `.abc` (Standard ABC)
|
183 |
+
- **Output:** `.abc` *(Interleaved ABC for CLaMP 3)*
|
184 |
+
|
185 |
+
##### **1.2 Convert MIDI to MTF Format**
|
186 |
+
CLaMP 3 processes **performance signals** in **MIDI Text Format (MTF)**. Convert **MIDI files** (`.mid`, `.midi`) into **MTF format** using [`batch_midi2mtf.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/midi/batch_midi2mtf.py):
|
187 |
+
|
188 |
+
```bash
|
189 |
+
python batch_midi2mtf.py
|
190 |
+
```
|
191 |
+
- **Input:** `.mid`, `.midi`
|
192 |
+
- **Output:** `.mtf` *(MTF for CLaMP 3)*
|
193 |
+
|
194 |
+
##### **1.3 Extract Audio Features using MERT**
|
195 |
+
For audio processing, CLaMP 3 uses **MERT-extracted features** instead of raw waveforms. Extract **MERT-based features** from raw audio (`.mp3`, `.wav`) using [`extract_mert.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/audio/extract_mert.py):
|
196 |
+
|
197 |
+
```bash
|
198 |
+
python extract_mert.py --input_path <input_path> --output_path <output_path> --model_path musichubert_hf/MERT-v1-95M --mean_features
|
199 |
+
```
|
200 |
+
- **Input:** `.mp3`, `.wav`
|
201 |
+
- **Output:** `.npy` *(Processed audio features for CLaMP 3)*
|
202 |
+
|
203 |
+
### **Training and Feature Extraction**
|
204 |
+
#### **1. Training Models**
|
205 |
+
Modify **[config.py](https://github.com/sanderwood/clamp3/blob/main/code/config.py)** to adjust **hyperparameters and data paths**.
|
206 |
+
|
207 |
+
To train CLaMP 3 on **symbolic music**, use **[train_clamp3_symbolic.py](https://github.com/sanderwood/clamp3/blob/main/code/train_clamp3_symbolic.py)**:
|
208 |
+
|
209 |
+
```bash
|
210 |
+
python -m torch.distributed.launch --nproc_per_node=<GPUs> --use_env train_clamp3_symbolic.py
|
211 |
+
```
|
212 |
+
|
213 |
+
For **audio data**, use **[train_clamp3_audio.py](https://github.com/sanderwood/clamp3/blob/main/code/train_clamp3_audio.py)**:
|
214 |
+
|
215 |
+
```bash
|
216 |
+
python -m torch.distributed.launch --nproc_per_node=<GPUs> --use_env train_clamp3_audio.py
|
217 |
+
```
|
218 |
+
|
219 |
+
Alternatively, you can use **pre-trained weights**:
|
220 |
+
- **[CLaMP 3 SAAS (Optimal for Audio)](https://huggingface.co/sander-wood/clamp3/blob/main/weights_clamp3_saas.pth)**
|
221 |
+
- **[CLaMP 3 C2 (Optimal for Symbolic Music)](https://huggingface.co/sander-wood/clamp3/blob/main/weights_clamp3_c2.pth)**
|
222 |
+
|
223 |
+
By default, CLaMP 3 is configured for the **SAAS version**, which provides **optimal performance on audio data**. If working primarily with **symbolic music**, download the **C2 variant** and modify **line 66 in `config.py`** from **saas** to **c2**.
|
224 |
+
|
225 |
+
#### **2. Feature Extraction**
|
226 |
+
After training (or using pre-trained weights), extract features using [`extract_clamp3.py`](https://github.com/sanderwood/clamp3/blob/main/code/extract_clamp3.py):
|
227 |
+
|
228 |
+
```bash
|
229 |
+
accelerate launch extract_clamp3.py --epoch <epoch> <input_dir> <output_dir> [--get_global]
|
230 |
+
```
|
231 |
+
- **`--epoch <epoch>`:** (Optional) Specify the checkpoint epoch.
|
232 |
+
- **`<input_dir>`:** Directory containing the input files.
|
233 |
+
- **`<output_dir>`:** Destination folder for the output `.npy` features.
|
234 |
+
- **`--get_global`**: (Optional) Flag to extract a global semantic vector for each input.
|
235 |
+
|
236 |
+
All extracted features are stored as `.npy` files.
|
237 |
+
|
238 |
+
> **Note**: In this project, we use the global semantic vectors (via average pooling and a linear layer) for both classification and retrieval tasks.
|
239 |
+
|
240 |
+
### **Retrieval and Classification**
|
241 |
+
#### **1. Semantic Search**
|
242 |
+
Retrieve **similar music features** using **[`semantic_search.py`](https://github.com/sanderwood/clamp3/tree/main/retrieval/semantic_search.py)**:
|
243 |
+
```bash
|
244 |
+
python semantic_search.py <query_file> <reference_folder> [--top_k TOP_K]
|
245 |
+
```
|
246 |
+
> **Note:** Zero-shot classification is essentially **semantic search**, where the query feature is compared against class prototypes.
|
247 |
+
|
248 |
+
#### **2. Classification**
|
249 |
+
Train a linear classifier using **[`train_cls.py`](https://github.com/sanderwood/clamp3/tree/main/classification/train_cls.py)**:
|
250 |
+
```bash
|
251 |
+
python train_cls.py --train_folder <path> --eval_folder <path> [--num_epochs <int>] [--learning_rate <float>] [--balanced_training]
|
252 |
+
```
|
253 |
+
Run inference with **[`inference_cls.py`](https://github.com/sanderwood/clamp3/tree/main/classification/inference_cls.py)**:
|
254 |
+
```bash
|
255 |
+
python inference_cls.py <weights_path> <feature_folder> <output_file>
|
256 |
```
|
257 |
|
258 |
+
## **Citation**
|
259 |
+
*Coming Soon...*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|