CultureMERT-TA-95M / README.md

Update README.md

0a16afd verified 18 days ago

7.03 kB

	---
	license: cc-by-nc-4.0
	tags:
	- audio
	- music
	- merge
	metrics:
	- roc_auc
	- average_precision
	- f1
	model_type: audio-classification
	pipeline_tag: audio-classification
	library_name: transformers
	base_model:
	- m-a-p/MERT-v1-95M
	---

	# CultureMERT: Continual Pre-Training for Cross-Cultural Music Representation Learning
	📑 [Read the full paper (to be presented at ISMIR 2025)](https://arxiv.org/abs/2506.17818)

	CultureMERT-TA-95M is a 95M-parameter music foundation model adapted to diverse musical cultures through [task arithmetic](https://arxiv.org/abs/2212.04089). Instead of direct continual pre-training on a multi-cultural mixture, as in [CultureMERT-95M](https://huggingface.co/ntua-slp/CultureMERT-95M), this model merges multiple single-culture adapted variants of [MERT-v1-95M](https://huggingface.co/m-a-p/MERT-v1-95M)—each continually pre-trained via our two-stage strategy on a distinct musical tradition:


	\| Dataset \| Music Tradition \| Hours Used \|
	\|-----------------\|-----------------------------\|------------\|
	\| [Lyra](https://github.com/pxaris/lyra-dataset) \| Greek traditional/folk \| 50h \|
	\| [Turkish-makam](https://dunya.compmusic.upf.edu/makam/) \| Turkish/Ottoman classical \| 200h \|
	\| [Hindustani](https://dunya.compmusic.upf.edu/hindustani/) \| North Indian classical \| 200h \|
	\| [Carnatic](https://dunya.compmusic.upf.edu/carnatic/) \| South Indian classical \| 200h \|


	> 🧪 The final model was merged using a scaling factor of λ = 0.2, which yielded the best overall performance across all task arithmetic variants evaluated.


	🔀 This model serves as an alternative to [CultureMERT-95M](https://huggingface.co/ntua-slp/CultureMERT-95M). It merges culturally specialized models in weight space via task arithmetic to form a unified multi-cultural model. Each single-culture adapted model is obtained using the same two-stage continual pre-training strategy as CultureMERT-95M, applied separately to each musical tradition prior to merging.

	---

	# 📊 Evaluation

	We follow the same evaluation protocol as [CultureMERT-95M](https://huggingface.co/ntua-slp/CultureMERT-95M) and report results in comparison to both it and [MERT-v1-95M](https://huggingface.co/m-a-p/MERT-v1-95M):


	## ROC-AUC / mAP

	\| \| Turkish-makam \| Hindustani \| Carnatic \| Lyra \| FMA-medium \| MTAT \| Avg. \|
	\|--------------------\|:-------------:\|:----------:\|:--------:\|:----:\|:---:\|:----:\|:--------:\|
	\| MERT-v1-95M \| 83.2% / 53.3% \| 82.4% / 52.9% \| 74.9% / 39.7% \| 85.7% / 56.5% \| 90.7% / 48.1% \| 89.6% / 35.9% \| 66.1% \|
	\| CultureMERT-95M \| 89.6% / 60.6% \| 88.2% / 63.5% \| 79.2% / 43.1% \| 86.9% / 56.7% \| 90.7% / 48.1% \| 89.4% / 35.9% \| 69.3% \|
	\| CultureMERT-TA-95M \| 89.0% / 61.0% \| 87.5% / 59.3% \| 79.1% / 43.3% \| 87.3% / 57.3% \| 90.8% / 49.1% \| 89.6% / 36.4% \| 69.1% \|


	## Micro-F1 / Macro-F1

	\| \| Turkish-makam \| Hindustani \| Carnatic \| Lyra \| FMA-medium \| MTAT \| Avg. \|
	\|--------------------\|:-------------:\|:----------:\|:--------:\|:----:\|:---:\|:----:\|:--------:\|
	\| MERT-v1-95M \| 73.0% / 38.9% \| 71.1% / 33.2% \| 80.1% / 30.0% \| 72.4% / 42.6% \| 57.0% / 36.9% \| 35.7% / 21.2% \| 49.3% \|
	\| CultureMERT-95M \| 77.4% / 45.8% \| 77.8% / 50.4% \| 82.7% / 32.5% \| 73.1% / 43.1% \| 58.3% / 36.6% \| 35.6% / 22.9% \| 52.9% \|
	\| CultureMERT-TA-95M \| 76.9% / 45.4% \| 74.2% / 45.0% \| 82.5% / 32.1% \| 73.0% / 45.3% \| 59.1% / 38.2% \| 35.7% / 21.5% \| 52.4% \|


	📈 CultureMERT-TA-95M performs comparably to CultureMERT-95M on non-Western datasets, while surpassing it on Lyra and Western benchmarks. It also outperforms MERT-v1-95M on Western tasks (MTAT and FMA-medium) by an average margin of +0.7% across all metrics.

	---

	# 🔧 Model Usage

	```python
	from transformers import Wav2Vec2FeatureExtractor, AutoModel
	import torch
	from torch import nn
	import torchaudio.transforms as T
	from datasets import load_dataset

	# Load model weights and preprocessor config
	model = AutoModel.from_pretrained("ntua-slp/CultureMERT-TA-95M", trust_remote_code=True)
	processor = Wav2Vec2FeatureExtractor.from_pretrained("ntua-slp/CultureMERT-TA-95M", trust_remote_code=True)

	# Load example audio
	dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation", trust_remote_code=True).sort("id")
	audio_array = dataset[0]["audio"]["array"]
	sampling_rate = dataset.features["audio"].sampling_rate

	# Resample if needed
	resample_rate = processor.sampling_rate
	if resample_rate != sampling_rate:
	print(f'Setting sample rate from {sampling_rate} to {resample_rate}')
	resampler = T.Resample(sampling_rate, resample_rate)
	else:
	resampler = None

	# Audio file is decoded on the fly
	if resampler is None:
	input_audio = dataset[0]["audio"]["array"]
	else:
	input_audio = resampler(torch.from_numpy(dataset[0]["audio"]["array"]).to(dtype=resampler.kernel.dtype))

	# Extract hidden states
	inputs = processor(input_audio, sampling_rate=resample_rate, return_tensors="pt")
	with torch.no_grad():
	outputs = model(**inputs, output_hidden_states=True)

	# Representations: 13 layers (CNN feature extractor + 12 Transformer)
	# NOTE: each layer performs differently in different downstream tasks - you should choose empirically
	all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze()
	print(all_layer_hidden_states.shape) # [13 layers, Time steps, 768 feature_dim]

	# For utterance-level classification tasks, you can simply reduce the representation in time
	time_reduced_hidden_states = all_layer_hidden_states.mean(-2)
	print(time_reduced_hidden_states.shape) # [13, 768]

	# You can even use a learnable weighted average representation over all layers
	aggregator = nn.Conv1d(in_channels=13, out_channels=1, kernel_size=1)
	weighted_avg_hidden_states = aggregator(time_reduced_hidden_states.unsqueeze(0)).squeeze()
	print(weighted_avg_hidden_states.shape) # [768]
	```

	---

	# Ethical Considerations

	This model is released under a non-commercial CC BY-NC 4.0 license and is intended for research purposes. While it is designed to address cultural bias in MIR, its training data and pretraining paradigm may still reflect cultural and dataset-specific biases. The model should not be used in commercial or generative applications without explicit consideration of cultural representation, proper attribution, and consent from relevant communities or dataset curators.


	# 📚 Citation

	```shell
	@misc{kanatas2025culturemertcontinualpretrainingcrosscultural,
	title={CultureMERT: Continual Pre-Training for Cross-Cultural Music Representation Learning},
	author={Angelos-Nikolaos Kanatas and Charilaos Papaioannou and Alexandros Potamianos},
	year={2025},
	eprint={2506.17818},
	archivePrefix={arXiv},
	primaryClass={cs.SD},
	url={https://arxiv.org/abs/2506.17818},
	}
	```

	---