clamp3 / README.md

Update README.md

f94d6d5 verified 13 days ago

9.73 kB

	---
	license: mit
	language:
	- multilingual
	- af
	- am
	- ar
	- as
	- az
	- be
	- bg
	- bn
	- br
	- bs
	- ca
	- cs
	- cy
	- da
	- de
	- el
	- en
	- eo
	- es
	- et
	- eu
	- fa
	- fi
	- fr
	- fy
	- ga
	- gd
	- gl
	- gu
	- ha
	- he
	- hi
	- hr
	- hu
	- hy
	- id
	- is
	- it
	- ja
	- jv
	- ka
	- kk
	- km
	- kn
	- ko
	- ku
	- ky
	- la
	- lo
	- lt
	- lv
	- mg
	- mk
	- ml
	- mn
	- mr
	- ms
	- my
	- ne
	- nl
	- no
	- om
	- or
	- pa
	- pl
	- ps
	- pt
	- ro
	- ru
	- sa
	- sd
	- si
	- sk
	- sl
	- so
	- sq
	- sr
	- su
	- sv
	- sw
	- ta
	- te
	- th
	- tl
	- tr
	- ug
	- uk
	- ur
	- uz
	- vi
	- xh
	- yi
	- zh
	- yue
	base_model:
	- FacebookAI/xlm-roberta-base
	pipeline_tag: feature-extraction
	tags:
	- music
	---
	# CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages

	<p align="center">
	<img src="overview.png" alt="CLaMP 3 Overview" width="50%">
	</p>

	## Overview
	CLaMP 3 is a multimodal and multilingual framework for music information retrieval (MIR). By using contrastive learning, it aligns sheet music, audio, performance signals, and multilingual text into a shared representation space, enabling retrieval across unaligned musical modalities.

	### Key Features
	- Multimodal Support:
	- Sheet Music: Uses Interleaved ABC notation, with a context size of 512 bars.
	- Performance Signals: Processes MIDI Text Format (MTF) data, with a context size of 512 MIDI messages.
	- Audio Recordings: Works with features extracted by [MERT](https://arxiv.org/abs/2306.00107), with a context size of 640 seconds of audio.

	- Multilingual Capabilities:
	- Trained on 27 languages and generalizes to all 100 languages supported by [XLM-R](https://arxiv.org/abs/1911.02116).

	- Datasets & Benchmarking:
	- [M4-RAG](https://huggingface.co/datasets/sander-wood/m4-rag): A large-scale dataset of 2.31M high-quality music-text pairs across 27 languages and 194 countries.
	- [WikiMT-X](https://huggingface.co/datasets/sander-wood/wikimt-x): A MIR benchmark containing 1,000 triplets of sheet music, audio, and diverse text annotations.

	### Applications
	CLaMP 3 supports a wide range of music research tasks, including but not limited to:
	- Semantic Retrieval: Find music based on descriptions or retrieve textual metadata for audio or symbolic inputs.
	- Zero-Shot Classification: Categorize music by genre, region, or other attributes without labeled training data.
	- Music Quality Assessment: Compute the semantic distance between reference and generated music features, similar to Fréchet Inception Distance (FID).
	- Cross-Modal Generative Model Evaluation: Assess text-to-music generation, music captioning, and symbolic-to-audio synthesis models.
	- Computational Musicology: By visualizing the distribution of data within the shared representation space, researchers can explore regional music patterns, stylistic similarities, and cross-cultural influences.

	Importantly, these applications are not restricted to any specific music modality or language, making CLaMP 3 a powerful tool for diverse music AI research.

	## Links
	- CLaMP 3 Demo Page (Coming Soon...)
	- CLaMP 3 Paper (Coming Soon...)
	- [CLaMP 3 Code](https://github.com/sanderwood/clamp3)
	- [CLaMP 3 Model Weights](https://huggingface.co/sander-wood/clamp3/tree/main)
	- [M4-RAG Pre-training Dataset](https://huggingface.co/datasets/sander-wood/m4-rag)
	- [WikiMT-X Evaluation Benchmark](https://huggingface.co/datasets/sander-wood/wikimt-x)

	> Note: Ensure the model weights are placed in the `code/` folder, and verify the configuration hyperparameters before use.

	## Repository Structure
	- [code/](https://github.com/sanderwood/clamp3/tree/main/code) → Training & feature extraction scripts.
	- [classification/](https://github.com/sanderwood/clamp3/tree/main/classification) → Linear classification training and prediction.
	- [preprocessing/](https://github.com/sanderwood/clamp3/tree/main/preprocessing) → Convert data into Interleaved ABC, MTF, or MERT-extracted features.
	- [retrieval/](https://github.com/sanderwood/clamp3/tree/main/retrieval) → Semantic search, retrieval evaluation, and similarity calculations.

	## Getting Started
	### Environment Setup
	To set up the environment for CLaMP 3, run:
	```bash
	conda env create -f environment.yml
	conda activate clamp3
	```

	### Data Preparation
	#### 1. Convert Music Data to Compatible Formats
	Before using CLaMP 3, preprocess MusicXML files into Interleaved ABC, MIDI files into MTF, and audio files into MERT-extracted features.

	> Note: Each script requires a manual edit of the `input_dir` variable at the top of the file before running, except for the MERT extraction script (`extract_mert.py`), which takes command-line arguments for input and output paths.

	##### 1.1 Convert MusicXML to Interleaved ABC Notation

	CLaMP 3 requires Interleaved ABC notation for sheet music. To achieve this, first, convert MusicXML (`.mxl`, `.xml`, `.musicxml`) to standard ABC using [`batch_xml2abc.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/abc/batch_xml2abc.py):

	```bash
	python batch_xml2abc.py
	```
	- Input: `.mxl`, `.xml`, `.musicxml`
	- Output: `.abc` (Standard ABC)

	Next, process the standard ABC files into Interleaved ABC notation using [`batch_interleaved_abc.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/abc/batch_interleaved_abc.py):

	```bash
	python batch_interleaved_abc.py
	```
	- Input: `.abc` (Standard ABC)
	- Output: `.abc` (Interleaved ABC for CLaMP 3)

	##### 1.2 Convert MIDI to MTF Format
	CLaMP 3 processes performance signals in MIDI Text Format (MTF). Convert MIDI files (`.mid`, `.midi`) into MTF format using [`batch_midi2mtf.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/midi/batch_midi2mtf.py):

	```bash
	python batch_midi2mtf.py
	```
	- Input: `.mid`, `.midi`
	- Output: `.mtf` (MTF for CLaMP 3)

	##### 1.3 Extract Audio Features using MERT
	For audio processing, CLaMP 3 uses MERT-extracted features instead of raw waveforms. Extract MERT-based features from raw audio (`.mp3`, `.wav`) using [`extract_mert.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/audio/extract_mert.py):

	```bash
	python extract_mert.py --input_path <input_path> --output_path <output_path> --model_path musichubert_hf/MERT-v1-95M --mean_features
	```
	- Input: `.mp3`, `.wav`
	- Output: `.npy` (Processed audio features for CLaMP 3)

	### Training and Feature Extraction
	#### 1. Training Models
	Modify [config.py](https://github.com/sanderwood/clamp3/blob/main/code/config.py) to adjust hyperparameters and data paths.

	To train CLaMP 3 on symbolic music, use [train_clamp3_symbolic.py](https://github.com/sanderwood/clamp3/blob/main/code/train_clamp3_symbolic.py):

	```bash
	python -m torch.distributed.launch --nproc_per_node=<GPUs> --use_env train_clamp3_symbolic.py
	```

	For audio data, use [train_clamp3_audio.py](https://github.com/sanderwood/clamp3/blob/main/code/train_clamp3_audio.py):

	```bash
	python -m torch.distributed.launch --nproc_per_node=<GPUs> --use_env train_clamp3_audio.py
	```

	Alternatively, you can use pre-trained weights:
	- [CLaMP 3 SAAS (Optimal for Audio)](https://huggingface.co/sander-wood/clamp3/blob/main/weights_clamp3_saas.pth)
	- [CLaMP 3 C2 (Optimal for Symbolic Music)](https://huggingface.co/sander-wood/clamp3/blob/main/weights_clamp3_c2.pth)

	By default, CLaMP 3 is configured for the SAAS version, which provides optimal performance on audio data. If working primarily with symbolic music, download the C2 variant and modify line 66 in `config.py` from saas to c2.

	#### 2. Feature Extraction
	After training (or using pre-trained weights), extract features using [`extract_clamp3.py`](https://github.com/sanderwood/clamp3/blob/main/code/extract_clamp3.py):

	```bash
	accelerate launch extract_clamp3.py --epoch <epoch> <input_dir> <output_dir> [--get_global]
	```
	- `--epoch <epoch>`: (Optional) Specify the checkpoint epoch.
	- `<input_dir>`: Directory containing the input files.
	- `<output_dir>`: Destination folder for the output `.npy` features.
	- `--get_global`: (Optional) Flag to extract a global semantic vector for each input.

	All extracted features are stored as `.npy` files.

	> Note: In this project, we use the global semantic vectors (via average pooling and a linear layer) for both classification and retrieval tasks.

	### Retrieval and Classification
	#### 1. Semantic Search
	Retrieve similar music features using [`semantic_search.py`](https://github.com/sanderwood/clamp3/tree/main/retrieval/semantic_search.py):
	```bash
	python semantic_search.py <query_file> <reference_folder> [--top_k TOP_K]
	```
	> Note: Zero-shot classification is essentially semantic search, where the query feature is compared against class prototypes.

	#### 2. Classification
	Train a linear classifier using [`train_cls.py`](https://github.com/sanderwood/clamp3/tree/main/classification/train_cls.py):
	```bash
	python train_cls.py --train_folder <path> --eval_folder <path> [--num_epochs <int>] [--learning_rate <float>] [--balanced_training]
	```
	Run inference with [`inference_cls.py`](https://github.com/sanderwood/clamp3/tree/main/classification/inference_cls.py):
	```bash
	python inference_cls.py <weights_path> <feature_folder> <output_file>
	```

	## Citation
	Coming Soon...