---
license: mit
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- no
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh  
- yue  
base_model:
- FacebookAI/xlm-roberta-base
pipeline_tag: feature-extraction
tags:
- music
---
# **CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages**

<p align="center">
  <img src="overview.png" alt="CLaMP 3 Overview" width="50%">
</p>

## **Overview**
CLaMP 3 is a **multimodal and multilingual** framework for **music information retrieval (MIR)**. By using **contrastive learning**, it aligns **sheet music, audio, performance signals, and multilingual text** into a **shared representation space**, enabling retrieval across unaligned musical modalities.  

### **Key Features**  
- **Multimodal Support:**  
   - **Sheet Music:** Uses **Interleaved ABC notation**, with a context size of **512 bars**.  
   - **Performance Signals:** Processes **MIDI Text Format (MTF)** data, with a context size of **512 MIDI messages**.  
   - **Audio Recordings:** Works with features extracted by **[MERT](https://arxiv.org/abs/2306.00107)**, with a context size of **640 seconds of audio**.

- **Multilingual Capabilities:**  
   - Trained on **27 languages** and generalizes to all **100 languages** supported by **[XLM-R](https://arxiv.org/abs/1911.02116)**.  

- **Datasets & Benchmarking:**  
   - **[M4-RAG](https://huggingface.co/datasets/sander-wood/m4-rag):** A **large-scale** dataset of **2.31M high-quality music-text pairs** across 27 languages and 194 countries.  
   - **[WikiMT-X](https://huggingface.co/datasets/sander-wood/wikimt-x):** A MIR benchmark containing **1,000 triplets** of sheet music, audio, and diverse text annotations.  

### **Applications**  
CLaMP 3 supports a **wide range of music research tasks**, including but not limited to:  
- **Semantic Retrieval:** Find music based on **descriptions** or retrieve textual metadata for **audio or symbolic** inputs.  
- **Zero-Shot Classification:** Categorize **music by genre, region, or other attributes** without labeled training data.  
- **Music Quality Assessment:** Compute the **semantic distance** between reference and generated music features, similar to **Fréchet Inception Distance (FID)**.  
- **Cross-Modal Generative Model Evaluation:** Assess **text-to-music generation, music captioning**, and **symbolic-to-audio synthesis** models.  
- **Computational Musicology:** By visualizing the distribution of data within the **shared representation space**, researchers can explore regional music patterns, stylistic similarities, and cross-cultural influences. 

Importantly, these applications are **not restricted to any specific music modality or language**, making CLaMP 3 a powerful tool for **diverse music AI research**.

## **Links**
- **CLaMP 3 Demo Page** *(Coming Soon...)*
- **CLaMP 3 Paper** *(Coming Soon...)*
- **[CLaMP 3 Code](https://github.com/sanderwood/clamp3)**
- **[CLaMP 3 Model Weights](https://huggingface.co/sander-wood/clamp3/tree/main)**
- **[M4-RAG Pre-training Dataset](https://huggingface.co/datasets/sander-wood/m4-rag)**
- **[WikiMT-X Evaluation Benchmark](https://huggingface.co/datasets/sander-wood/wikimt-x)**

> **Note:** Ensure the model weights are placed in the `code/` folder, and verify the **configuration hyperparameters** before use.

## **Repository Structure**
- **[code/](https://github.com/sanderwood/clamp3/tree/main/code)** → Training & feature extraction scripts.
- **[classification/](https://github.com/sanderwood/clamp3/tree/main/classification)** → Linear classification training and prediction.  
- **[preprocessing/](https://github.com/sanderwood/clamp3/tree/main/preprocessing)** → Convert data into **Interleaved ABC, MTF, or MERT-extracted features**.  
- **[retrieval/](https://github.com/sanderwood/clamp3/tree/main/retrieval)** → Semantic search, retrieval evaluation, and similarity calculations.  

## **Getting Started**
### **Environment Setup**
To set up the environment for CLaMP 3, run:  
```bash
conda env create -f environment.yml
conda activate clamp3
```

### **Data Preparation**
#### **1. Convert Music Data to Compatible Formats**
Before using CLaMP 3, preprocess **MusicXML files** into **Interleaved ABC**, **MIDI files** into **MTF**, and **audio files** into **MERT-extracted features**.

> **Note:** Each script requires a manual edit of the `input_dir` variable at the top of the file before running, **except for the MERT extraction script (`extract_mert.py`), which takes command-line arguments for input and output paths.**

##### **1.1 Convert MusicXML to Interleaved ABC Notation**

CLaMP 3 requires **Interleaved ABC notation** for sheet music. To achieve this, first, convert **MusicXML** (`.mxl`, `.xml`, `.musicxml`) to **standard ABC** using [`batch_xml2abc.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/abc/batch_xml2abc.py):

```bash
python batch_xml2abc.py
```
- **Input:** `.mxl`, `.xml`, `.musicxml`  
- **Output:** `.abc` (Standard ABC)
 
Next, process the standard ABC files into **Interleaved ABC notation** using [`batch_interleaved_abc.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/abc/batch_interleaved_abc.py):

```bash
python batch_interleaved_abc.py
```
- **Input:** `.abc` (Standard ABC)  
- **Output:** `.abc` *(Interleaved ABC for CLaMP 3)*  

##### **1.2 Convert MIDI to MTF Format**
CLaMP 3 processes **performance signals** in **MIDI Text Format (MTF)**. Convert **MIDI files** (`.mid`, `.midi`) into **MTF format** using [`batch_midi2mtf.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/midi/batch_midi2mtf.py):

```bash
python batch_midi2mtf.py
```
- **Input:** `.mid`, `.midi`  
- **Output:** `.mtf` *(MTF for CLaMP 3)*  

##### **1.3 Extract Audio Features using MERT**
For audio processing, CLaMP 3 uses **MERT-extracted features** instead of raw waveforms. Extract **MERT-based features** from raw audio (`.mp3`, `.wav`) using [`extract_mert.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/audio/extract_mert.py):

```bash
python extract_mert.py --input_path <input_path> --output_path <output_path> --model_path musichubert_hf/MERT-v1-95M --mean_features
```
- **Input:** `.mp3`, `.wav`  
- **Output:** `.npy` *(Processed audio features for CLaMP 3)*  

### **Training and Feature Extraction**
#### **1. Training Models**
Modify **[config.py](https://github.com/sanderwood/clamp3/blob/main/code/config.py)** to adjust **hyperparameters and data paths**.

To train CLaMP 3 on **symbolic music**, use **[train_clamp3_symbolic.py](https://github.com/sanderwood/clamp3/blob/main/code/train_clamp3_symbolic.py)**:

```bash
python -m torch.distributed.launch --nproc_per_node=<GPUs> --use_env train_clamp3_symbolic.py
```

For **audio data**, use **[train_clamp3_audio.py](https://github.com/sanderwood/clamp3/blob/main/code/train_clamp3_audio.py)**:

```bash
python -m torch.distributed.launch --nproc_per_node=<GPUs> --use_env train_clamp3_audio.py
```

Alternatively, you can use **pre-trained weights**:
- **[CLaMP 3 SAAS (Optimal for Audio)](https://huggingface.co/sander-wood/clamp3/blob/main/weights_clamp3_saas.pth)**
- **[CLaMP 3 C2 (Optimal for Symbolic Music)](https://huggingface.co/sander-wood/clamp3/blob/main/weights_clamp3_c2.pth)**  

By default, CLaMP 3 is configured for the **SAAS version**, which provides **optimal performance on audio data**. If working primarily with **symbolic music**, download the **C2 variant** and modify **line 66 in `config.py`** from **saas** to **c2**.

#### **2. Feature Extraction**
After training (or using pre-trained weights), extract features using [`extract_clamp3.py`](https://github.com/sanderwood/clamp3/blob/main/code/extract_clamp3.py):

```bash
accelerate launch extract_clamp3.py --epoch <epoch> <input_dir> <output_dir> [--get_global]
```
- **`--epoch <epoch>`:** (Optional) Specify the checkpoint epoch.
- **`<input_dir>`:** Directory containing the input files.
- **`<output_dir>`:** Destination folder for the output `.npy` features.
- **`--get_global`**: (Optional) Flag to extract a global semantic vector for each input.

All extracted features are stored as `.npy` files.

> **Note**: In this project, we use the global semantic vectors (via average pooling and a linear layer) for both classification and retrieval tasks.

### **Retrieval and Classification**
#### **1. Semantic Search**
Retrieve **similar music features** using **[`semantic_search.py`](https://github.com/sanderwood/clamp3/tree/main/retrieval/semantic_search.py)**:  
```bash
python semantic_search.py <query_file> <reference_folder> [--top_k TOP_K]
```
> **Note:** Zero-shot classification is essentially **semantic search**, where the query feature is compared against class prototypes.  

#### **2. Classification**
Train a linear classifier using **[`train_cls.py`](https://github.com/sanderwood/clamp3/tree/main/classification/train_cls.py)**:  
```bash
python train_cls.py --train_folder <path> --eval_folder <path> [--num_epochs <int>] [--learning_rate <float>] [--balanced_training]
```
Run inference with **[`inference_cls.py`](https://github.com/sanderwood/clamp3/tree/main/classification/inference_cls.py)**:  
```bash
python inference_cls.py <weights_path> <feature_folder> <output_file>
```

## **Citation**
*Coming Soon...*