CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages
Overview
CLaMP 3 is a unified framework for cross-modal and cross-lingual music information retrieval (MIR). By using contrastive learning, it aligns sheet music, audio, performance signals, and multilingual text into a shared representation space, enabling retrieval across unaligned musical modalities. Key features include:
Multimodal Support:
- Sheet Music: Uses Interleaved ABC notation.
- Performance Signals: Processes MIDI Text Format (MTF) data.
- Audio Recordings: Works with audio features extracted by MERT.
Multilingual Capabilities: Trained on 27 languages and generalizes to all 100 languages supported by XLM-R.
Dataset and Benchmark:
- Trained on M4-RAG, a large-scale dataset of 2.31M high-quality music-text pairs across 27 languages and 194 countries.
- Introduces WikiMT-X, a benchmark containing 1,000 triplets of sheet music, audio, and text.
CLaMP 3 supports a wide range of applications in MIR and music research, including but not limited to:
- Semantic retrieval: Searching for music based on descriptive text or retrieving textual metadata based on audio or symbolic representations.
- Zero-shot classification: Categorizing music by genre, region, or other attributes without labeled training data.
- Music quality assessment: Measuring the semantic distance between reference ground truth and generated music using metrics analogous to Fréchet Inception Distance (FID), providing an objective alternative to human evaluation.
- Evaluation of generative models: Assessing text-to-music generation, music captioning, and symbolic-to-audio synthesis models by quantifying their alignment across different modalities.
- Computational musicology: Enabling studies in geographical musicology, analyzing regional distributions and cross-cultural influences through large-scale multimodal datasets.
Importantly, these applications are not restricted to any specific musical modality or language. CLaMP 3's multimodal and multilingual design allows it to generalize across diverse datasets, making it a powerful tool for a wide range of music-related AI research.
Links
- CLaMP 3 Demo Page (Coming Soon...)
- CLaMP 3 Paper (Coming Soon...)
- CLaMP 3 Code
- CLaMP 3 Model Weights
- M4-RAG Pre-training Dataset
- WikiMT-X Evaluation Benchmark
Note Ensure the model weights for CLaMP 3 are placed under the
code/
folder for proper loading. Also, verify that the configuration hyperparameters are correctly set.
Repository Structure
code/: Contains scripts for training CLaMP 3 and extracting features from music and text data. You can modify hyperparameters and file paths in the configuration files for custom training.
classification/: Includes scripts for classification tasks using extracted features, such as training linear classification models and making predictions.
preprocessing/: Scripts for converting your data into compatible formats (interleaved ABC notation, MTF, or MERT-extracted audio features). These are required for CLaMP 3 to work with the data.
retrieval/: Provides scripts for evaluating model performance, conducting semantic searches, and calculating similarity metrics based on extracted feature vectors.
Note For detailed instructions on how to use the scripts in each folder, please refer to the individual README files within those directories. This main README provides only a high-level overview of the repository.
Getting Started
Environment Setup
To set up the environment for CLaMP 3, run the following commands:
conda env create -f environment.yml
conda activate clamp3
Data Preparation
Convert Files: Navigate to the
preprocessing/
folder and convert your music files into a compatible format (interleaved ABC notation, MTF, or MERT-extracted audio features) suitable for use with CLaMP 3. Whether you are training or performing inference, you must use these preprocessing scripts to ensure the data is in the correct format.- Interleaved ABC Notation:
- Convert MusicXML files to ABC using batch_xml2abc.py.
- Process the ABC files into interleaved notation using batch_interleaved_abc.py.
- MTF:
- Convert MIDI files to MTF format using batch_midi2mtf.py.
- MERT-extracted Audio Features:
- Extract audio features using MERT by running the scripts extract_mert.py. These features will be saved as
.npy
files and are ready for use in CLaMP 3.
- Extract audio features using MERT by running the scripts extract_mert.py. These features will be saved as
- Interleaved ABC Notation:
Prepare Text Metadata (Optional): If you plan to train the model, you will need to prepare corresponding metadata for each music file. The metadata should be in JSON format, containing details like title, artist, region, language, and description.
Example:
{ "filepaths": ["audio/--/---aL9TdeI4.npy"], "id": "---aL9TdeI4", "title": "Mairi's Wedding", "artists": ["Noel McLoughlin"], "region": "United Kingdom of Great Britain and Northern Ireland", "language": "English", "genres": ["Folk", "Traditional"], "tags": ["Scottish", "Wedding", "Traditional", "Folk", "Celtic"], "background": "Mairi's Wedding is a Scottish folk song...", "analysis": "The song has a lively and upbeat Scottish folk rhythm...", "description": "A traditional folk song with a joyful celebration...", "scene": "The setting is a picturesque Scottish village on a sunny morning...", "translations": { "language": "Vietnamese", "background": "Bài hát \"Đám Cưới Mairi\"..." } }
Once your JSON files are ready, merge them into a single
.jsonl
file and structure the directories as shown:/your-data-folder/ ├── abc/ ├── audio/ ├── mtf/ ├── merged_output.jsonl
Training and Feature Extraction
Training Models: If you want to train CLaMP 3, check the training scripts in the code/ folder and modify the config.py file to set your hyperparameters and data paths.
Extracting Features: After training (or if you have pre-trained weights), extract features from preprocessed data using extract_clamp3.py. The script automatically detects the modality based on the file extension (e.g.,
.txt
,.abc
,.mtf
,.npy
). Make sure your data has already been converted into CLaMP 3–compatible formats by following the scripts in the preprocessing/` folder.
Classification and Retrieval
Classification: To perform classification on the extracted features, navigate to the classification/ directory. You’ll find scripts for training and making predictions using linear classification models.
Semantic Search: To conduct semantic searches using the extracted features, refer to the scripts in the retrieval/ folder.
Citation
Coming Soon...
Model tree for sander-wood/clamp3
Base model
FacebookAI/xlm-roberta-base