|
--- |
|
license: cc-by-nc-4.0 |
|
tags: |
|
- audio |
|
- music |
|
- merge |
|
metrics: |
|
- roc_auc |
|
- average_precision |
|
- f1 |
|
model_type: audio-classification |
|
pipeline_tag: audio-classification |
|
library_name: transformers |
|
base_model: |
|
- m-a-p/MERT-v1-95M |
|
--- |
|
|
|
# CultureMERT: Continual Pre-Training for Cross-Cultural Music Representation Learning |
|
๐ [**Read the full paper (to be presented at ISMIR 2025)**](https://arxiv.org/abs/2506.17818) |
|
|
|
**CultureMERT-TA-95M** is a 95M-parameter music foundation model adapted to diverse musical cultures through [**task arithmetic**](https://arxiv.org/abs/2212.04089). Instead of direct continual pre-training on a multi-cultural mixture, as in [CultureMERT-95M](https://huggingface.co/ntua-slp/CultureMERT-95M), this model merges multiple **single-culture adapted** variants of [MERT-v1-95M](https://huggingface.co/m-a-p/MERT-v1-95M)โeach continually pre-trained via our two-stage strategy on a distinct musical tradition: |
|
|
|
|
|
| Dataset | Music Tradition | Hours Used | |
|
|-----------------|-----------------------------|------------| |
|
| [*Lyra*](https://github.com/pxaris/lyra-dataset) | Greek traditional/folk | 50h | |
|
| [*Turkish-makam*](https://dunya.compmusic.upf.edu/makam/) | Turkish/Ottoman classical | 200h | |
|
| [*Hindustani*](https://dunya.compmusic.upf.edu/hindustani/) | North Indian classical | 200h | |
|
| [*Carnatic*](https://dunya.compmusic.upf.edu/carnatic/) | South Indian classical | 200h | |
|
|
|
|
|
> ๐งช The final model was merged using a scaling factor of **ฮป = 0.2**, which yielded the best overall performance across all task arithmetic variants evaluated. |
|
|
|
|
|
๐ This model serves as an alternative to [**CultureMERT-95M**](https://huggingface.co/ntua-slp/CultureMERT-95M). It merges culturally specialized models in weight space via task arithmetic to form a unified multi-cultural model. Each single-culture adapted model is obtained using the same two-stage continual pre-training strategy as CultureMERT-95M, applied separately to each musical tradition prior to merging. |
|
|
|
--- |
|
|
|
# ๐ Evaluation |
|
|
|
We follow the same evaluation protocol as [CultureMERT-95M](https://huggingface.co/ntua-slp/CultureMERT-95M) and report results in comparison to both it and [MERT-v1-95M](https://huggingface.co/m-a-p/MERT-v1-95M): |
|
|
|
|
|
## ROC-AUC / mAP |
|
|
|
| | Turkish-makam | Hindustani | Carnatic | Lyra | FMA-medium | MTAT | **Avg.** | |
|
|--------------------|:-------------:|:----------:|:--------:|:----:|:---:|:----:|:--------:| |
|
| **MERT-v1-95M** | 83.2% / 53.3% | 82.4% / 52.9% | 74.9% / 39.7% | 85.7% / 56.5% | 90.7% / 48.1% | 89.6% / 35.9% | 66.1% | |
|
| **CultureMERT-95M** | **89.6%** / 60.6% | **88.2%** / **63.5%** | **79.2%** / 43.1% | 86.9% / 56.7% | 90.7% / 48.1% | 89.4% / 35.9% | **69.3%** | |
|
| **CultureMERT-TA-95M** | 89.0% / **61.0%** | 87.5% / 59.3% | 79.1% / **43.3%** | **87.3%** / **57.3%** | **90.8%** / **49.1%** | 89.6% / **36.4%** | 69.1% | |
|
|
|
|
|
## Micro-F1 / Macro-F1 |
|
|
|
| | Turkish-makam | Hindustani | Carnatic | Lyra | FMA-medium | MTAT | **Avg.** | |
|
|--------------------|:-------------:|:----------:|:--------:|:----:|:---:|:----:|:--------:| |
|
| **MERT-v1-95M** | 73.0% / 38.9% | 71.1% / 33.2% | 80.1% / 30.0% | 72.4% / 42.6% | 57.0% / 36.9% | 35.7% / 21.2% | 49.3% | |
|
| **CultureMERT-95M** | **77.4%** / **45.8%** | **77.8%** / **50.4%** | **82.7%** / **32.5%** | **73.1%** / 43.1% | 58.3% / 36.6% | 35.6% / **22.9%** | **52.9%** | |
|
| **CultureMERT-TA-95M** | 76.9% / 45.4% | 74.2% / 45.0% | 82.5% / 32.1% | 73.0% / **45.3%** | **59.1%** / **38.2%** | 35.7% / 21.5% | 52.4% | |
|
|
|
|
|
๐ **CultureMERT-TA-95M** performs comparably to **CultureMERT-95M** on non-Western datasets, while surpassing it on *Lyra* and Western benchmarks. It also outperforms **MERT-v1-95M** on Western tasks (MTAT and FMA-medium) by an average margin of **+0.7%** across all metrics. |
|
|
|
--- |
|
|
|
# ๐ง Model Usage |
|
|
|
```python |
|
from transformers import Wav2Vec2FeatureExtractor, AutoModel |
|
import torch |
|
from torch import nn |
|
import torchaudio.transforms as T |
|
from datasets import load_dataset |
|
|
|
# Load model weights and preprocessor config |
|
model = AutoModel.from_pretrained("ntua-slp/CultureMERT-TA-95M", trust_remote_code=True) |
|
processor = Wav2Vec2FeatureExtractor.from_pretrained("ntua-slp/CultureMERT-TA-95M", trust_remote_code=True) |
|
|
|
# Load example audio |
|
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation", trust_remote_code=True).sort("id") |
|
audio_array = dataset[0]["audio"]["array"] |
|
sampling_rate = dataset.features["audio"].sampling_rate |
|
|
|
# Resample if needed |
|
resample_rate = processor.sampling_rate |
|
if resample_rate != sampling_rate: |
|
print(f'Setting sample rate from {sampling_rate} to {resample_rate}') |
|
resampler = T.Resample(sampling_rate, resample_rate) |
|
else: |
|
resampler = None |
|
|
|
# Audio file is decoded on the fly |
|
if resampler is None: |
|
input_audio = dataset[0]["audio"]["array"] |
|
else: |
|
input_audio = resampler(torch.from_numpy(dataset[0]["audio"]["array"]).to(dtype=resampler.kernel.dtype)) |
|
|
|
# Extract hidden states |
|
inputs = processor(input_audio, sampling_rate=resample_rate, return_tensors="pt") |
|
with torch.no_grad(): |
|
outputs = model(**inputs, output_hidden_states=True) |
|
|
|
# Representations: 13 layers (CNN feature extractor + 12 Transformer) |
|
# NOTE: each layer performs differently in different downstream tasks - you should choose empirically |
|
all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze() |
|
print(all_layer_hidden_states.shape) # [13 layers, Time steps, 768 feature_dim] |
|
|
|
# For utterance-level classification tasks, you can simply reduce the representation in time |
|
time_reduced_hidden_states = all_layer_hidden_states.mean(-2) |
|
print(time_reduced_hidden_states.shape) # [13, 768] |
|
|
|
# You can even use a learnable weighted average representation over all layers |
|
aggregator = nn.Conv1d(in_channels=13, out_channels=1, kernel_size=1) |
|
weighted_avg_hidden_states = aggregator(time_reduced_hidden_states.unsqueeze(0)).squeeze() |
|
print(weighted_avg_hidden_states.shape) # [768] |
|
``` |
|
|
|
--- |
|
|
|
# Ethical Considerations |
|
|
|
This model is released under a non-commercial CC BY-NC 4.0 license and is intended for research purposes. While it is designed to address cultural bias in MIR, its training data and pretraining paradigm may still reflect cultural and dataset-specific biases. The model should not be used in commercial or generative applications without explicit consideration of cultural representation, proper attribution, and consent from relevant communities or dataset curators. |
|
|
|
|
|
# ๐ Citation |
|
|
|
```shell |
|
@misc{kanatas2025culturemertcontinualpretrainingcrosscultural, |
|
title={CultureMERT: Continual Pre-Training for Cross-Cultural Music Representation Learning}, |
|
author={Angelos-Nikolaos Kanatas and Charilaos Papaioannou and Alexandros Potamianos}, |
|
year={2025}, |
|
eprint={2506.17818}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.SD}, |
|
url={https://arxiv.org/abs/2506.17818}, |
|
} |
|
``` |
|
|
|
--- |