Feature Extraction
music
sander-wood commited on
Commit
f94d6d5
·
verified ·
1 Parent(s): 1458565

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +132 -103
README.md CHANGED
@@ -102,129 +102,158 @@ pipeline_tag: feature-extraction
102
  tags:
103
  - music
104
  ---
105
- # CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages
106
 
107
  <p align="center">
108
  <img src="overview.png" alt="CLaMP 3 Overview" width="50%">
109
  </p>
110
 
 
 
111
 
112
- ## Overview
113
- CLaMP 3 is a unified framework for cross-modal and cross-lingual music information retrieval (MIR). By using contrastive learning, it aligns sheet music, audio, performance signals, and multilingual text into a shared representation space, enabling retrieval across unaligned musical modalities. Key features include:
 
 
 
114
 
115
- - **Multimodal Support:**
116
- 1. **Sheet Music:** Uses Interleaved ABC notation.
117
- 2. **Performance Signals:** Processes MIDI Text Format (MTF) data.
118
- 3. **Audio Recordings:** Works with audio features extracted by [MERT](https://arxiv.org/abs/2306.00107).
119
 
120
- - **Multilingual Capabilities:** Trained on **27 languages** and generalizes to all **100 languages** supported by [XLM-R](https://arxiv.org/abs/1911.02116).
 
 
121
 
122
- - **Dataset and Benchmark:**
123
- - Trained on **M4-RAG**, a large-scale dataset of 2.31M high-quality music-text pairs across 27 languages and 194 countries.
124
- - Introduces **WikiMT-X**, a benchmark containing 1,000 triplets of sheet music, audio, and text.
 
 
 
 
125
 
126
- CLaMP 3 supports a wide range of applications in MIR and music research, including but not limited to:
127
- - **Semantic retrieval:** Searching for music based on descriptive text or retrieving textual metadata based on audio or symbolic representations.
128
- - **Zero-shot classification:** Categorizing music by genre, region, or other attributes without labeled training data.
129
- - **Music quality assessment:** Measuring the **semantic distance** between reference ground truth and generated music using metrics analogous to **Fréchet Inception Distance (FID)**, providing an objective alternative to human evaluation.
130
- - **Evaluation of generative models:** Assessing **text-to-music generation**, **music captioning**, and **symbolic-to-audio synthesis** models by quantifying their alignment across different modalities.
131
- - **Computational musicology:** Enabling studies in **geographical musicology**, analyzing regional distributions and cross-cultural influences through large-scale multimodal datasets.
132
 
133
- Importantly, these applications are **not restricted to any specific musical modality or language**. CLaMP 3's multimodal and multilingual design allows it to generalize across diverse datasets, making it a powerful tool for a wide range of music-related AI research.
 
 
 
 
 
 
134
 
135
- ### Links
136
- - CLaMP 3 Demo Page (Coming Soon...)
137
- - CLaMP 3 Paper (Coming Soon...)
138
- - [CLaMP 3 Code](https://github.com/sanderwood/clamp3)
139
- - [CLaMP 3 Model Weights](https://huggingface.co/sander-wood/clamp3/tree/main)
140
- - [M4-RAG Pre-training Dataset](https://huggingface.co/datasets/sander-wood/m4-rag)
141
- - [WikiMT-X Evaluation Benchmark](https://huggingface.co/datasets/sander-wood/wikimt-x)
142
 
143
- > **Note** Ensure the model weights for CLaMP 3 are placed under the `code/` folder for proper loading. Also, verify that the configuration hyperparameters are correctly set.
 
 
 
 
144
 
145
- ## Repository Structure
146
- - [code/](https://github.com/sanderwood/clamp3/tree/main/code): Contains scripts for training CLaMP 3 and extracting features from music and text data. You can modify hyperparameters and file paths in the configuration files for custom training.
147
-
148
- - [classification/](https://github.com/sanderwood/clamp3/tree/main/classification): Includes scripts for classification tasks using extracted features, such as training linear classification models and making predictions.
 
 
 
149
 
150
- - [preprocessing/](https://github.com/sanderwood/clamp3/tree/main/preprocessing): Scripts for converting your data into compatible formats (interleaved ABC notation, MTF, or MERT-extracted audio features). These are required for CLaMP 3 to work with the data.
 
 
151
 
152
- - [retrieval/](https://github.com/sanderwood/clamp3/tree/main/retrieval): Provides scripts for evaluating model performance, conducting semantic searches, and calculating similarity metrics based on extracted feature vectors.
153
 
154
- > **Note** For detailed instructions on how to use the scripts in each folder, please refer to the individual README files within those directories. This main README provides only a high-level overview of the repository.
155
 
156
- ## Getting Started
157
 
158
- ### Environment Setup
159
- To set up the environment for CLaMP 3, run the following commands:
 
 
 
 
 
160
 
161
  ```bash
162
- conda env create -f environment.yml
163
- conda activate clamp3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
164
  ```
165
 
166
- ### Data Preparation
167
- 1. **Convert Files**: Navigate to the `preprocessing/` folder and convert your music files into a compatible format (interleaved ABC notation, MTF, or MERT-extracted audio features) suitable for use with CLaMP 3. Whether you are training or performing inference, **you must use these preprocessing scripts to ensure the data is in the correct format**.
168
- 1. **Interleaved ABC Notation**:
169
- - Convert MusicXML files to ABC using [batch_xml2abc.py](https://github.com/sanderwood/clamp3/blob/main/preprocessing/abc/batch_xml2abc.py).
170
- - Process the ABC files into interleaved notation using [batch_interleaved_abc.py](https://github.com/sanderwood/clamp3/blob/main/preprocessing/abc/batch_interleaved_abc.py).
171
- 2. **MTF**:
172
- - Convert MIDI files to MTF format using [batch_midi2mtf.py](https://github.com/sanderwood/clamp3/blob/main/preprocessing/midi/batch_midi2mtf.py).
173
- 3. **MERT-extracted Audio Features**:
174
- - Extract audio features using MERT by running the scripts [extract_mert.py](https://github.com/sanderwood/clamp3/tree/main/preprocessing/audio/extract_mert.py). These features will be saved as `.npy` files and are ready for use in CLaMP 3.
175
-
176
- 2. **Prepare Text Metadata (Optional)**: If you plan to train the model, you will need to prepare corresponding metadata for each music file. The metadata should be in JSON format, containing details like title, artist, region, language, and description.
177
-
178
- Example:
179
- ```json
180
- {
181
- "filepaths": ["audio/--/---aL9TdeI4.npy"],
182
- "id": "---aL9TdeI4",
183
- "title": "Mairi's Wedding",
184
- "artists": ["Noel McLoughlin"],
185
- "region": "United Kingdom of Great Britain and Northern Ireland",
186
- "language": "English",
187
- "genres": ["Folk", "Traditional"],
188
- "tags": ["Scottish", "Wedding", "Traditional", "Folk", "Celtic"],
189
- "background": "Mairi's Wedding is a Scottish folk song...",
190
- "analysis": "The song has a lively and upbeat Scottish folk rhythm...",
191
- "description": "A traditional folk song with a joyful celebration...",
192
- "scene": "The setting is a picturesque Scottish village on a sunny morning...",
193
- "translations": { "language": "Vietnamese", "background": "Bài hát \"Đám Cưới Mairi\"..." }
194
- }
195
- ```
196
-
197
- Once your JSON files are ready, merge them into a single `.jsonl` file and structure the directories as shown:
198
-
199
- ```
200
- /your-data-folder/
201
- ├── abc/
202
- ├── audio/
203
- ├── mtf/
204
- ├── merged_output.jsonl
205
- ```
206
-
207
- ### Training and Feature Extraction
208
- 2. **Training Models**: If you want to train CLaMP 3, check the training scripts in the [code/](https://github.com/sanderwood/clamp3/tree/main/code) folder and modify the [config.py](https://github.com/sanderwood/clamp3/blob/main/code/config.py) file to set your hyperparameters and data paths.
209
-
210
- 3. **Extracting Features**: After training (or if you have pre-trained weights), extract features from **preprocessed** data using [extract_clamp3.py](https://github.com/sanderwood/clamp3/blob/main/code/extract_clamp3.py). The script automatically detects the modality based on the file extension (e.g., `.txt`, `.abc`, `.mtf`, `.npy`). Make sure your data has already been converted into CLaMP 3–compatible formats by following the scripts in the [preprocessing/](https://github.com/sanderwood/clamp3/tree/main/preprocessing)` folder.
211
-
212
- ### Classification and Retrieval
213
- 4. **Classification**: To perform classification on the extracted features, navigate to the [classification/](https://github.com/sanderwood/clamp3/tree/main/classification) directory. You’ll find scripts for training and making predictions using linear classification models.
214
-
215
- 5. **Semantic Search**: To conduct semantic searches using the extracted features, refer to the scripts in the [retrieval/](https://github.com/sanderwood/clamp3/tree/main/retrieval) folder.
216
-
217
- ## Citation
218
- Coming Soon...
219
- <!-- If you use CLaMP 3, M4-RAG, or WikiMT-X in your research, please cite the following paper:
220
-
221
- bibtex
222
- @misc{wu2024clamp2multimodalmusic,
223
- title={CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models},
224
- author={Shangda Wu and Yashan Wang and Ruibin Yuan and Zhancheng Guo and Xu Tan and Ge Zhang and Monan Zhou and Jing Chen and Xuefeng Mu and Yuejie Gao and Yuanliang Dong and Jiafeng Liu and Xiaobing Li and Feng Yu and Maosong Sun},
225
- year={2024},
226
- eprint={2410.13267},
227
- archivePrefix={arXiv},
228
- primaryClass={cs.SD},
229
- url={https://arxiv.org/abs/2410.13267},
230
- } -->
 
102
  tags:
103
  - music
104
  ---
105
+ # **CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages**
106
 
107
  <p align="center">
108
  <img src="overview.png" alt="CLaMP 3 Overview" width="50%">
109
  </p>
110
 
111
+ ## **Overview**
112
+ CLaMP 3 is a **multimodal and multilingual** framework for **music information retrieval (MIR)**. By using **contrastive learning**, it aligns **sheet music, audio, performance signals, and multilingual text** into a **shared representation space**, enabling retrieval across unaligned musical modalities.
113
 
114
+ ### **Key Features**
115
+ - **Multimodal Support:**
116
+ - **Sheet Music:** Uses **Interleaved ABC notation**, with a context size of **512 bars**.
117
+ - **Performance Signals:** Processes **MIDI Text Format (MTF)** data, with a context size of **512 MIDI messages**.
118
+ - **Audio Recordings:** Works with features extracted by **[MERT](https://arxiv.org/abs/2306.00107)**, with a context size of **640 seconds of audio**.
119
 
120
+ - **Multilingual Capabilities:**
121
+ - Trained on **27 languages** and generalizes to all **100 languages** supported by **[XLM-R](https://arxiv.org/abs/1911.02116)**.
 
 
122
 
123
+ - **Datasets & Benchmarking:**
124
+ - **[M4-RAG](https://huggingface.co/datasets/sander-wood/m4-rag):** A **large-scale** dataset of **2.31M high-quality music-text pairs** across 27 languages and 194 countries.
125
+ - **[WikiMT-X](https://huggingface.co/datasets/sander-wood/wikimt-x):** A MIR benchmark containing **1,000 triplets** of sheet music, audio, and diverse text annotations.
126
 
127
+ ### **Applications**
128
+ CLaMP 3 supports a **wide range of music research tasks**, including but not limited to:
129
+ - **Semantic Retrieval:** Find music based on **descriptions** or retrieve textual metadata for **audio or symbolic** inputs.
130
+ - **Zero-Shot Classification:** Categorize **music by genre, region, or other attributes** without labeled training data.
131
+ - **Music Quality Assessment:** Compute the **semantic distance** between reference and generated music features, similar to **Fréchet Inception Distance (FID)**.
132
+ - **Cross-Modal Generative Model Evaluation:** Assess **text-to-music generation, music captioning**, and **symbolic-to-audio synthesis** models.
133
+ - **Computational Musicology:** By visualizing the distribution of data within the **shared representation space**, researchers can explore regional music patterns, stylistic similarities, and cross-cultural influences.
134
 
135
+ Importantly, these applications are **not restricted to any specific music modality or language**, making CLaMP 3 a powerful tool for **diverse music AI research**.
 
 
 
 
 
136
 
137
+ ## **Links**
138
+ - **CLaMP 3 Demo Page** *(Coming Soon...)*
139
+ - **CLaMP 3 Paper** *(Coming Soon...)*
140
+ - **[CLaMP 3 Code](https://github.com/sanderwood/clamp3)**
141
+ - **[CLaMP 3 Model Weights](https://huggingface.co/sander-wood/clamp3/tree/main)**
142
+ - **[M4-RAG Pre-training Dataset](https://huggingface.co/datasets/sander-wood/m4-rag)**
143
+ - **[WikiMT-X Evaluation Benchmark](https://huggingface.co/datasets/sander-wood/wikimt-x)**
144
 
145
+ > **Note:** Ensure the model weights are placed in the `code/` folder, and verify the **configuration hyperparameters** before use.
 
 
 
 
 
 
146
 
147
+ ## **Repository Structure**
148
+ - **[code/](https://github.com/sanderwood/clamp3/tree/main/code)** → Training & feature extraction scripts.
149
+ - **[classification/](https://github.com/sanderwood/clamp3/tree/main/classification)** → Linear classification training and prediction.
150
+ - **[preprocessing/](https://github.com/sanderwood/clamp3/tree/main/preprocessing)** → Convert data into **Interleaved ABC, MTF, or MERT-extracted features**.
151
+ - **[retrieval/](https://github.com/sanderwood/clamp3/tree/main/retrieval)** → Semantic search, retrieval evaluation, and similarity calculations.
152
 
153
+ ## **Getting Started**
154
+ ### **Environment Setup**
155
+ To set up the environment for CLaMP 3, run:
156
+ ```bash
157
+ conda env create -f environment.yml
158
+ conda activate clamp3
159
+ ```
160
 
161
+ ### **Data Preparation**
162
+ #### **1. Convert Music Data to Compatible Formats**
163
+ Before using CLaMP 3, preprocess **MusicXML files** into **Interleaved ABC**, **MIDI files** into **MTF**, and **audio files** into **MERT-extracted features**.
164
 
165
+ > **Note:** Each script requires a manual edit of the `input_dir` variable at the top of the file before running, **except for the MERT extraction script (`extract_mert.py`), which takes command-line arguments for input and output paths.**
166
 
167
+ ##### **1.1 Convert MusicXML to Interleaved ABC Notation**
168
 
169
+ CLaMP 3 requires **Interleaved ABC notation** for sheet music. To achieve this, first, convert **MusicXML** (`.mxl`, `.xml`, `.musicxml`) to **standard ABC** using [`batch_xml2abc.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/abc/batch_xml2abc.py):
170
 
171
+ ```bash
172
+ python batch_xml2abc.py
173
+ ```
174
+ - **Input:** `.mxl`, `.xml`, `.musicxml`
175
+ - **Output:** `.abc` (Standard ABC)
176
+
177
+ Next, process the standard ABC files into **Interleaved ABC notation** using [`batch_interleaved_abc.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/abc/batch_interleaved_abc.py):
178
 
179
  ```bash
180
+ python batch_interleaved_abc.py
181
+ ```
182
+ - **Input:** `.abc` (Standard ABC)
183
+ - **Output:** `.abc` *(Interleaved ABC for CLaMP 3)*
184
+
185
+ ##### **1.2 Convert MIDI to MTF Format**
186
+ CLaMP 3 processes **performance signals** in **MIDI Text Format (MTF)**. Convert **MIDI files** (`.mid`, `.midi`) into **MTF format** using [`batch_midi2mtf.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/midi/batch_midi2mtf.py):
187
+
188
+ ```bash
189
+ python batch_midi2mtf.py
190
+ ```
191
+ - **Input:** `.mid`, `.midi`
192
+ - **Output:** `.mtf` *(MTF for CLaMP 3)*
193
+
194
+ ##### **1.3 Extract Audio Features using MERT**
195
+ For audio processing, CLaMP 3 uses **MERT-extracted features** instead of raw waveforms. Extract **MERT-based features** from raw audio (`.mp3`, `.wav`) using [`extract_mert.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/audio/extract_mert.py):
196
+
197
+ ```bash
198
+ python extract_mert.py --input_path <input_path> --output_path <output_path> --model_path musichubert_hf/MERT-v1-95M --mean_features
199
+ ```
200
+ - **Input:** `.mp3`, `.wav`
201
+ - **Output:** `.npy` *(Processed audio features for CLaMP 3)*
202
+
203
+ ### **Training and Feature Extraction**
204
+ #### **1. Training Models**
205
+ Modify **[config.py](https://github.com/sanderwood/clamp3/blob/main/code/config.py)** to adjust **hyperparameters and data paths**.
206
+
207
+ To train CLaMP 3 on **symbolic music**, use **[train_clamp3_symbolic.py](https://github.com/sanderwood/clamp3/blob/main/code/train_clamp3_symbolic.py)**:
208
+
209
+ ```bash
210
+ python -m torch.distributed.launch --nproc_per_node=<GPUs> --use_env train_clamp3_symbolic.py
211
+ ```
212
+
213
+ For **audio data**, use **[train_clamp3_audio.py](https://github.com/sanderwood/clamp3/blob/main/code/train_clamp3_audio.py)**:
214
+
215
+ ```bash
216
+ python -m torch.distributed.launch --nproc_per_node=<GPUs> --use_env train_clamp3_audio.py
217
+ ```
218
+
219
+ Alternatively, you can use **pre-trained weights**:
220
+ - **[CLaMP 3 SAAS (Optimal for Audio)](https://huggingface.co/sander-wood/clamp3/blob/main/weights_clamp3_saas.pth)**
221
+ - **[CLaMP 3 C2 (Optimal for Symbolic Music)](https://huggingface.co/sander-wood/clamp3/blob/main/weights_clamp3_c2.pth)**
222
+
223
+ By default, CLaMP 3 is configured for the **SAAS version**, which provides **optimal performance on audio data**. If working primarily with **symbolic music**, download the **C2 variant** and modify **line 66 in `config.py`** from **saas** to **c2**.
224
+
225
+ #### **2. Feature Extraction**
226
+ After training (or using pre-trained weights), extract features using [`extract_clamp3.py`](https://github.com/sanderwood/clamp3/blob/main/code/extract_clamp3.py):
227
+
228
+ ```bash
229
+ accelerate launch extract_clamp3.py --epoch <epoch> <input_dir> <output_dir> [--get_global]
230
+ ```
231
+ - **`--epoch <epoch>`:** (Optional) Specify the checkpoint epoch.
232
+ - **`<input_dir>`:** Directory containing the input files.
233
+ - **`<output_dir>`:** Destination folder for the output `.npy` features.
234
+ - **`--get_global`**: (Optional) Flag to extract a global semantic vector for each input.
235
+
236
+ All extracted features are stored as `.npy` files.
237
+
238
+ > **Note**: In this project, we use the global semantic vectors (via average pooling and a linear layer) for both classification and retrieval tasks.
239
+
240
+ ### **Retrieval and Classification**
241
+ #### **1. Semantic Search**
242
+ Retrieve **similar music features** using **[`semantic_search.py`](https://github.com/sanderwood/clamp3/tree/main/retrieval/semantic_search.py)**:
243
+ ```bash
244
+ python semantic_search.py <query_file> <reference_folder> [--top_k TOP_K]
245
+ ```
246
+ > **Note:** Zero-shot classification is essentially **semantic search**, where the query feature is compared against class prototypes.
247
+
248
+ #### **2. Classification**
249
+ Train a linear classifier using **[`train_cls.py`](https://github.com/sanderwood/clamp3/tree/main/classification/train_cls.py)**:
250
+ ```bash
251
+ python train_cls.py --train_folder <path> --eval_folder <path> [--num_epochs <int>] [--learning_rate <float>] [--balanced_training]
252
+ ```
253
+ Run inference with **[`inference_cls.py`](https://github.com/sanderwood/clamp3/tree/main/classification/inference_cls.py)**:
254
+ ```bash
255
+ python inference_cls.py <weights_path> <feature_folder> <output_file>
256
  ```
257
 
258
+ ## **Citation**
259
+ *Coming Soon...*