ntua-slp
/

CultureMERT-TA-95M

Audio Classification

feature-extraction

Model card Files Files and versions

akanatas commited on 28 days ago

Commit

24241ee

·

verified ·

1 Parent(s): ae24c14

Update README.md

Files changed (1) hide show

README.md +71 -1

README.md CHANGED Viewed

@@ -12,4 +12,74 @@ pipeline_tag: audio-classification
 library_name: transformers
 ---
-...TODO

 library_name: transformers
 ---
+# CultureMERT: Continual Pre-Training for Cross-Cultural Music Representation Learning
+📑 [**Read the full paper (to be presented at ISMIR 2025)**](...TODO)
+---
+# 🔧 Model Usage
+```python
+from transformers import Wav2Vec2FeatureExtractor, AutoModel
+import torch
+from torch import nn
+import torchaudio.transforms as T
+from datasets import load_dataset
+# Load model weights and preprocessor config
+model = AutoModel.from_pretrained("ntua-slp/CultureMERT-TA-95M", trust_remote_code=True)
+processor = Wav2Vec2FeatureExtractor.from_pretrained("ntua-slp/CultureMERT-TA-95M", trust_remote_code=True)
+# Load example audio
+dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation").sort("id")
+audio_array = dataset[0]["audio"]["array"]
+sampling_rate = dataset.features["audio"].sampling_rate
+# Resample if needed
+resample_rate = processor.sampling_rate
+if resample_rate != sampling_rate:
+    print(f'Setting sample rate from {sampling_rate} to {resample_rate}')
+    resampler = T.Resample(sampling_rate, resample_rate)
+else:
+    resampler = None
+# Audio file is decoded on the fly
+if resampler is None:
+	input_audio = dataset[0]["audio"]["array"]
+else:
+  input_audio = resampler(torch.from_numpy(dataset[0]["audio"]["array"]))
+# Extract hidden states
+inputs = processor(input_audio, sampling_rate=resample_rate, return_tensors="pt")
+with torch.no_grad():
+    outputs = model(**inputs, output_hidden_states=True)
+# Representations: 13 layers (CNN feature extractor + 12 Transformer)
+# NOTE: each layer performs differently in different downstream tasks - you should choose empirically
+all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze()
+print(all_layer_hidden_states.shape) # [13 layer, Time steps, 768 feature_dim]
+# For utterance-level classification tasks, you can simply reduce the representation in time
+time_reduced_hidden_states = all_layer_hidden_states.mean(-2)
+print(time_reduced_hidden_states.shape) # [13, 768]
+# You can even use a learnable weighted average representation over all layers
+aggregator = nn.Conv1d(in_channels=13, out_channels=1, kernel_size=1)
+weighted_avg_hidden_states = aggregator(time_reduced_hidden_states.unsqueeze(0)).squeeze()
+print(weighted_avg_hidden_states.shape) # [768]
+```
+---
+# Ethical Considerations
+This model is released under a non-commercial CC BY-NC 4.0 license and is intended for research purposes. While it is designed to address cultural bias in MIR, its training data and pretraining paradigm may still reflect cultural and dataset-specific biases. The model should not be used in commercial or generative applications without explicit consideration of cultural representation, proper attribution, and consent from relevant communities and dataset curators.
+# 📚 Citation
+```shell
+...TODO
+```
+---