|
--- |
|
license: mit |
|
language: |
|
- multilingual |
|
- af |
|
- am |
|
- ar |
|
- as |
|
- az |
|
- be |
|
- bg |
|
- bn |
|
- br |
|
- bs |
|
- ca |
|
- cs |
|
- cy |
|
- da |
|
- de |
|
- el |
|
- en |
|
- eo |
|
- es |
|
- et |
|
- eu |
|
- fa |
|
- fi |
|
- fr |
|
- fy |
|
- ga |
|
- gd |
|
- gl |
|
- gu |
|
- ha |
|
- he |
|
- hi |
|
- hr |
|
- hu |
|
- hy |
|
- id |
|
- is |
|
- it |
|
- ja |
|
- jv |
|
- ka |
|
- kk |
|
- km |
|
- kn |
|
- ko |
|
- ku |
|
- ky |
|
- la |
|
- lo |
|
- lt |
|
- lv |
|
- mg |
|
- mk |
|
- ml |
|
- mn |
|
- mr |
|
- ms |
|
- my |
|
- ne |
|
- nl |
|
- no |
|
- om |
|
- or |
|
- pa |
|
- pl |
|
- ps |
|
- pt |
|
- ro |
|
- ru |
|
- sa |
|
- sd |
|
- si |
|
- sk |
|
- sl |
|
- so |
|
- sq |
|
- sr |
|
- su |
|
- sv |
|
- sw |
|
- ta |
|
- te |
|
- th |
|
- tl |
|
- tr |
|
- ug |
|
- uk |
|
- ur |
|
- uz |
|
- vi |
|
- xh |
|
- yi |
|
- zh |
|
- yue |
|
base_model: |
|
- FacebookAI/xlm-roberta-base |
|
pipeline_tag: feature-extraction |
|
tags: |
|
- music |
|
--- |
|
# **CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages** |
|
|
|
<p align="center"> |
|
<img src="overview.png" alt="CLaMP 3 Overview" width="50%"> |
|
</p> |
|
|
|
## **Overview** |
|
CLaMP 3 is a **multimodal and multilingual** framework for **music information retrieval (MIR)**. By using **contrastive learning**, it aligns **sheet music, audio, performance signals, and multilingual text** into a **shared representation space**, enabling retrieval across unaligned musical modalities. |
|
|
|
### **Key Features** |
|
- **Multimodal Support:** |
|
- **Sheet Music:** Uses **Interleaved ABC notation**, with a context size of **512 bars**. |
|
- **Performance Signals:** Processes **MIDI Text Format (MTF)** data, with a context size of **512 MIDI messages**. |
|
- **Audio Recordings:** Works with features extracted by **[MERT](https://arxiv.org/abs/2306.00107)**, with a context size of **640 seconds of audio**. |
|
|
|
- **Multilingual Capabilities:** |
|
- Trained on **27 languages** and generalizes to all **100 languages** supported by **[XLM-R](https://arxiv.org/abs/1911.02116)**. |
|
|
|
- **Datasets & Benchmarking:** |
|
- **[M4-RAG](https://huggingface.co/datasets/sander-wood/m4-rag):** A **large-scale** dataset of **2.31M high-quality music-text pairs** across 27 languages and 194 countries. |
|
- **[WikiMT-X](https://huggingface.co/datasets/sander-wood/wikimt-x):** A MIR benchmark containing **1,000 triplets** of sheet music, audio, and diverse text annotations. |
|
|
|
### **Applications** |
|
CLaMP 3 supports a **wide range of music research tasks**, including but not limited to: |
|
- **Semantic Retrieval:** Find music based on **descriptions** or retrieve textual metadata for **audio or symbolic** inputs. |
|
- **Zero-Shot Classification:** Categorize **music by genre, region, or other attributes** without labeled training data. |
|
- **Music Quality Assessment:** Compute the **semantic distance** between reference and generated music features, similar to **Fréchet Inception Distance (FID)**. |
|
- **Cross-Modal Generative Model Evaluation:** Assess **text-to-music generation, music captioning**, and **symbolic-to-audio synthesis** models. |
|
- **Computational Musicology:** By visualizing the distribution of data within the **shared representation space**, researchers can explore regional music patterns, stylistic similarities, and cross-cultural influences. |
|
|
|
Importantly, these applications are **not restricted to any specific music modality or language**, making CLaMP 3 a powerful tool for **diverse music AI research**. |
|
|
|
## **Links** |
|
- **CLaMP 3 Demo Page** *(Coming Soon...)* |
|
- **CLaMP 3 Paper** *(Coming Soon...)* |
|
- **[CLaMP 3 Code](https://github.com/sanderwood/clamp3)** |
|
- **[CLaMP 3 Model Weights](https://huggingface.co/sander-wood/clamp3/tree/main)** |
|
- **[M4-RAG Pre-training Dataset](https://huggingface.co/datasets/sander-wood/m4-rag)** |
|
- **[WikiMT-X Evaluation Benchmark](https://huggingface.co/datasets/sander-wood/wikimt-x)** |
|
|
|
> **Note:** Ensure the model weights are placed in the `code/` folder, and verify the **configuration hyperparameters** before use. |
|
|
|
## **Repository Structure** |
|
- **[code/](https://github.com/sanderwood/clamp3/tree/main/code)** → Training & feature extraction scripts. |
|
- **[classification/](https://github.com/sanderwood/clamp3/tree/main/classification)** → Linear classification training and prediction. |
|
- **[preprocessing/](https://github.com/sanderwood/clamp3/tree/main/preprocessing)** → Convert data into **Interleaved ABC, MTF, or MERT-extracted features**. |
|
- **[retrieval/](https://github.com/sanderwood/clamp3/tree/main/retrieval)** → Semantic search, retrieval evaluation, and similarity calculations. |
|
|
|
## **Getting Started** |
|
### **Environment Setup** |
|
To set up the environment for CLaMP 3, run: |
|
```bash |
|
conda env create -f environment.yml |
|
conda activate clamp3 |
|
``` |
|
|
|
### **Data Preparation** |
|
#### **1. Convert Music Data to Compatible Formats** |
|
Before using CLaMP 3, preprocess **MusicXML files** into **Interleaved ABC**, **MIDI files** into **MTF**, and **audio files** into **MERT-extracted features**. |
|
|
|
> **Note:** Each script requires a manual edit of the `input_dir` variable at the top of the file before running, **except for the MERT extraction script (`extract_mert.py`), which takes command-line arguments for input and output paths.** |
|
|
|
##### **1.1 Convert MusicXML to Interleaved ABC Notation** |
|
|
|
CLaMP 3 requires **Interleaved ABC notation** for sheet music. To achieve this, first, convert **MusicXML** (`.mxl`, `.xml`, `.musicxml`) to **standard ABC** using [`batch_xml2abc.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/abc/batch_xml2abc.py): |
|
|
|
```bash |
|
python batch_xml2abc.py |
|
``` |
|
- **Input:** `.mxl`, `.xml`, `.musicxml` |
|
- **Output:** `.abc` (Standard ABC) |
|
|
|
Next, process the standard ABC files into **Interleaved ABC notation** using [`batch_interleaved_abc.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/abc/batch_interleaved_abc.py): |
|
|
|
```bash |
|
python batch_interleaved_abc.py |
|
``` |
|
- **Input:** `.abc` (Standard ABC) |
|
- **Output:** `.abc` *(Interleaved ABC for CLaMP 3)* |
|
|
|
##### **1.2 Convert MIDI to MTF Format** |
|
CLaMP 3 processes **performance signals** in **MIDI Text Format (MTF)**. Convert **MIDI files** (`.mid`, `.midi`) into **MTF format** using [`batch_midi2mtf.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/midi/batch_midi2mtf.py): |
|
|
|
```bash |
|
python batch_midi2mtf.py |
|
``` |
|
- **Input:** `.mid`, `.midi` |
|
- **Output:** `.mtf` *(MTF for CLaMP 3)* |
|
|
|
##### **1.3 Extract Audio Features using MERT** |
|
For audio processing, CLaMP 3 uses **MERT-extracted features** instead of raw waveforms. Extract **MERT-based features** from raw audio (`.mp3`, `.wav`) using [`extract_mert.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/audio/extract_mert.py): |
|
|
|
```bash |
|
python extract_mert.py --input_path <input_path> --output_path <output_path> --model_path musichubert_hf/MERT-v1-95M --mean_features |
|
``` |
|
- **Input:** `.mp3`, `.wav` |
|
- **Output:** `.npy` *(Processed audio features for CLaMP 3)* |
|
|
|
### **Training and Feature Extraction** |
|
#### **1. Training Models** |
|
Modify **[config.py](https://github.com/sanderwood/clamp3/blob/main/code/config.py)** to adjust **hyperparameters and data paths**. |
|
|
|
To train CLaMP 3 on **symbolic music**, use **[train_clamp3_symbolic.py](https://github.com/sanderwood/clamp3/blob/main/code/train_clamp3_symbolic.py)**: |
|
|
|
```bash |
|
python -m torch.distributed.launch --nproc_per_node=<GPUs> --use_env train_clamp3_symbolic.py |
|
``` |
|
|
|
For **audio data**, use **[train_clamp3_audio.py](https://github.com/sanderwood/clamp3/blob/main/code/train_clamp3_audio.py)**: |
|
|
|
```bash |
|
python -m torch.distributed.launch --nproc_per_node=<GPUs> --use_env train_clamp3_audio.py |
|
``` |
|
|
|
Alternatively, you can use **pre-trained weights**: |
|
- **[CLaMP 3 SAAS (Optimal for Audio)](https://huggingface.co/sander-wood/clamp3/blob/main/weights_clamp3_saas.pth)** |
|
- **[CLaMP 3 C2 (Optimal for Symbolic Music)](https://huggingface.co/sander-wood/clamp3/blob/main/weights_clamp3_c2.pth)** |
|
|
|
By default, CLaMP 3 is configured for the **SAAS version**, which provides **optimal performance on audio data**. If working primarily with **symbolic music**, download the **C2 variant** and modify **line 66 in `config.py`** from **saas** to **c2**. |
|
|
|
#### **2. Feature Extraction** |
|
After training (or using pre-trained weights), extract features using [`extract_clamp3.py`](https://github.com/sanderwood/clamp3/blob/main/code/extract_clamp3.py): |
|
|
|
```bash |
|
accelerate launch extract_clamp3.py --epoch <epoch> <input_dir> <output_dir> [--get_global] |
|
``` |
|
- **`--epoch <epoch>`:** (Optional) Specify the checkpoint epoch. |
|
- **`<input_dir>`:** Directory containing the input files. |
|
- **`<output_dir>`:** Destination folder for the output `.npy` features. |
|
- **`--get_global`**: (Optional) Flag to extract a global semantic vector for each input. |
|
|
|
All extracted features are stored as `.npy` files. |
|
|
|
> **Note**: In this project, we use the global semantic vectors (via average pooling and a linear layer) for both classification and retrieval tasks. |
|
|
|
### **Retrieval and Classification** |
|
#### **1. Semantic Search** |
|
Retrieve **similar music features** using **[`semantic_search.py`](https://github.com/sanderwood/clamp3/tree/main/retrieval/semantic_search.py)**: |
|
```bash |
|
python semantic_search.py <query_file> <reference_folder> [--top_k TOP_K] |
|
``` |
|
> **Note:** Zero-shot classification is essentially **semantic search**, where the query feature is compared against class prototypes. |
|
|
|
#### **2. Classification** |
|
Train a linear classifier using **[`train_cls.py`](https://github.com/sanderwood/clamp3/tree/main/classification/train_cls.py)**: |
|
```bash |
|
python train_cls.py --train_folder <path> --eval_folder <path> [--num_epochs <int>] [--learning_rate <float>] [--balanced_training] |
|
``` |
|
Run inference with **[`inference_cls.py`](https://github.com/sanderwood/clamp3/tree/main/classification/inference_cls.py)**: |
|
```bash |
|
python inference_cls.py <weights_path> <feature_folder> <output_file> |
|
``` |
|
|
|
## **Citation** |
|
*Coming Soon...* |