File size: 7,033 Bytes
1ffdf6b e4a87b2 1ffdf6b 16127ca 1ffdf6b 24241ee 0a16afd 24241ee a8e0cb2 7c970c8 5b1ba23 7c970c8 0655b33 5b1ba23 ed246f9 5b1ba23 eff55fe 5b1ba23 7c970c8 5b1ba23 340f8a5 5b1ba23 340f8a5 5b1ba23 04f5221 5b1ba23 24241ee 7db4c8c 24241ee f920cb4 24241ee 3d1bab9 24241ee 480a220 24241ee 0a16afd 24241ee 807c8ce |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
---
license: cc-by-nc-4.0
tags:
- audio
- music
- merge
metrics:
- roc_auc
- average_precision
- f1
model_type: audio-classification
pipeline_tag: audio-classification
library_name: transformers
base_model:
- m-a-p/MERT-v1-95M
---
# CultureMERT: Continual Pre-Training for Cross-Cultural Music Representation Learning
π [**Read the full paper (to be presented at ISMIR 2025)**](https://arxiv.org/abs/2506.17818)
**CultureMERT-TA-95M** is a 95M-parameter music foundation model adapted to diverse musical cultures through [**task arithmetic**](https://arxiv.org/abs/2212.04089). Instead of direct continual pre-training on a multi-cultural mixture, as in [CultureMERT-95M](https://huggingface.co/ntua-slp/CultureMERT-95M), this model merges multiple **single-culture adapted** variants of [MERT-v1-95M](https://huggingface.co/m-a-p/MERT-v1-95M)βeach continually pre-trained via our two-stage strategy on a distinct musical tradition:
| Dataset | Music Tradition | Hours Used |
|-----------------|-----------------------------|------------|
| [*Lyra*](https://github.com/pxaris/lyra-dataset) | Greek traditional/folk | 50h |
| [*Turkish-makam*](https://dunya.compmusic.upf.edu/makam/) | Turkish/Ottoman classical | 200h |
| [*Hindustani*](https://dunya.compmusic.upf.edu/hindustani/) | North Indian classical | 200h |
| [*Carnatic*](https://dunya.compmusic.upf.edu/carnatic/) | South Indian classical | 200h |
> π§ͺ The final model was merged using a scaling factor of **Ξ» = 0.2**, which yielded the best overall performance across all task arithmetic variants evaluated.
π This model serves as an alternative to [**CultureMERT-95M**](https://huggingface.co/ntua-slp/CultureMERT-95M). It merges culturally specialized models in weight space via task arithmetic to form a unified multi-cultural model. Each single-culture adapted model is obtained using the same two-stage continual pre-training strategy as CultureMERT-95M, applied separately to each musical tradition prior to merging.
---
# π Evaluation
We follow the same evaluation protocol as [CultureMERT-95M](https://huggingface.co/ntua-slp/CultureMERT-95M) and report results in comparison to both it and [MERT-v1-95M](https://huggingface.co/m-a-p/MERT-v1-95M):
## ROC-AUC / mAP
| | Turkish-makam | Hindustani | Carnatic | Lyra | FMA-medium | MTAT | **Avg.** |
|--------------------|:-------------:|:----------:|:--------:|:----:|:---:|:----:|:--------:|
| **MERT-v1-95M** | 83.2% / 53.3% | 82.4% / 52.9% | 74.9% / 39.7% | 85.7% / 56.5% | 90.7% / 48.1% | 89.6% / 35.9% | 66.1% |
| **CultureMERT-95M** | **89.6%** / 60.6% | **88.2%** / **63.5%** | **79.2%** / 43.1% | 86.9% / 56.7% | 90.7% / 48.1% | 89.4% / 35.9% | **69.3%** |
| **CultureMERT-TA-95M** | 89.0% / **61.0%** | 87.5% / 59.3% | 79.1% / **43.3%** | **87.3%** / **57.3%** | **90.8%** / **49.1%** | 89.6% / **36.4%** | 69.1% |
## Micro-F1 / Macro-F1
| | Turkish-makam | Hindustani | Carnatic | Lyra | FMA-medium | MTAT | **Avg.** |
|--------------------|:-------------:|:----------:|:--------:|:----:|:---:|:----:|:--------:|
| **MERT-v1-95M** | 73.0% / 38.9% | 71.1% / 33.2% | 80.1% / 30.0% | 72.4% / 42.6% | 57.0% / 36.9% | 35.7% / 21.2% | 49.3% |
| **CultureMERT-95M** | **77.4%** / **45.8%** | **77.8%** / **50.4%** | **82.7%** / **32.5%** | **73.1%** / 43.1% | 58.3% / 36.6% | 35.6% / **22.9%** | **52.9%** |
| **CultureMERT-TA-95M** | 76.9% / 45.4% | 74.2% / 45.0% | 82.5% / 32.1% | 73.0% / **45.3%** | **59.1%** / **38.2%** | 35.7% / 21.5% | 52.4% |
π **CultureMERT-TA-95M** performs comparably to **CultureMERT-95M** on non-Western datasets, while surpassing it on *Lyra* and Western benchmarks. It also outperforms **MERT-v1-95M** on Western tasks (MTAT and FMA-medium) by an average margin of **+0.7%** across all metrics.
---
# π§ Model Usage
```python
from transformers import Wav2Vec2FeatureExtractor, AutoModel
import torch
from torch import nn
import torchaudio.transforms as T
from datasets import load_dataset
# Load model weights and preprocessor config
model = AutoModel.from_pretrained("ntua-slp/CultureMERT-TA-95M", trust_remote_code=True)
processor = Wav2Vec2FeatureExtractor.from_pretrained("ntua-slp/CultureMERT-TA-95M", trust_remote_code=True)
# Load example audio
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation", trust_remote_code=True).sort("id")
audio_array = dataset[0]["audio"]["array"]
sampling_rate = dataset.features["audio"].sampling_rate
# Resample if needed
resample_rate = processor.sampling_rate
if resample_rate != sampling_rate:
print(f'Setting sample rate from {sampling_rate} to {resample_rate}')
resampler = T.Resample(sampling_rate, resample_rate)
else:
resampler = None
# Audio file is decoded on the fly
if resampler is None:
input_audio = dataset[0]["audio"]["array"]
else:
input_audio = resampler(torch.from_numpy(dataset[0]["audio"]["array"]).to(dtype=resampler.kernel.dtype))
# Extract hidden states
inputs = processor(input_audio, sampling_rate=resample_rate, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
# Representations: 13 layers (CNN feature extractor + 12 Transformer)
# NOTE: each layer performs differently in different downstream tasks - you should choose empirically
all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze()
print(all_layer_hidden_states.shape) # [13 layers, Time steps, 768 feature_dim]
# For utterance-level classification tasks, you can simply reduce the representation in time
time_reduced_hidden_states = all_layer_hidden_states.mean(-2)
print(time_reduced_hidden_states.shape) # [13, 768]
# You can even use a learnable weighted average representation over all layers
aggregator = nn.Conv1d(in_channels=13, out_channels=1, kernel_size=1)
weighted_avg_hidden_states = aggregator(time_reduced_hidden_states.unsqueeze(0)).squeeze()
print(weighted_avg_hidden_states.shape) # [768]
```
---
# Ethical Considerations
This model is released under a non-commercial CC BY-NC 4.0 license and is intended for research purposes. While it is designed to address cultural bias in MIR, its training data and pretraining paradigm may still reflect cultural and dataset-specific biases. The model should not be used in commercial or generative applications without explicit consideration of cultural representation, proper attribution, and consent from relevant communities or dataset curators.
# π Citation
```shell
@misc{kanatas2025culturemertcontinualpretrainingcrosscultural,
title={CultureMERT: Continual Pre-Training for Cross-Cultural Music Representation Learning},
author={Angelos-Nikolaos Kanatas and Charilaos Papaioannou and Alexandros Potamianos},
year={2025},
eprint={2506.17818},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2506.17818},
}
```
--- |