Update README.md
Browse files
README.md
CHANGED
@@ -12,4 +12,74 @@ pipeline_tag: audio-classification
|
|
12 |
library_name: transformers
|
13 |
---
|
14 |
|
15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
library_name: transformers
|
13 |
---
|
14 |
|
15 |
+
# CultureMERT: Continual Pre-Training for Cross-Cultural Music Representation Learning
|
16 |
+
📑 [**Read the full paper (to be presented at ISMIR 2025)**](...TODO)
|
17 |
+
|
18 |
+
---
|
19 |
+
|
20 |
+
# 🔧 Model Usage
|
21 |
+
|
22 |
+
```python
|
23 |
+
from transformers import Wav2Vec2FeatureExtractor, AutoModel
|
24 |
+
import torch
|
25 |
+
from torch import nn
|
26 |
+
import torchaudio.transforms as T
|
27 |
+
from datasets import load_dataset
|
28 |
+
|
29 |
+
# Load model weights and preprocessor config
|
30 |
+
model = AutoModel.from_pretrained("ntua-slp/CultureMERT-TA-95M", trust_remote_code=True)
|
31 |
+
processor = Wav2Vec2FeatureExtractor.from_pretrained("ntua-slp/CultureMERT-TA-95M", trust_remote_code=True)
|
32 |
+
|
33 |
+
# Load example audio
|
34 |
+
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation").sort("id")
|
35 |
+
audio_array = dataset[0]["audio"]["array"]
|
36 |
+
sampling_rate = dataset.features["audio"].sampling_rate
|
37 |
+
|
38 |
+
# Resample if needed
|
39 |
+
resample_rate = processor.sampling_rate
|
40 |
+
if resample_rate != sampling_rate:
|
41 |
+
print(f'Setting sample rate from {sampling_rate} to {resample_rate}')
|
42 |
+
resampler = T.Resample(sampling_rate, resample_rate)
|
43 |
+
else:
|
44 |
+
resampler = None
|
45 |
+
|
46 |
+
# Audio file is decoded on the fly
|
47 |
+
if resampler is None:
|
48 |
+
input_audio = dataset[0]["audio"]["array"]
|
49 |
+
else:
|
50 |
+
input_audio = resampler(torch.from_numpy(dataset[0]["audio"]["array"]))
|
51 |
+
|
52 |
+
# Extract hidden states
|
53 |
+
inputs = processor(input_audio, sampling_rate=resample_rate, return_tensors="pt")
|
54 |
+
with torch.no_grad():
|
55 |
+
outputs = model(**inputs, output_hidden_states=True)
|
56 |
+
|
57 |
+
# Representations: 13 layers (CNN feature extractor + 12 Transformer)
|
58 |
+
# NOTE: each layer performs differently in different downstream tasks - you should choose empirically
|
59 |
+
all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze()
|
60 |
+
print(all_layer_hidden_states.shape) # [13 layer, Time steps, 768 feature_dim]
|
61 |
+
|
62 |
+
# For utterance-level classification tasks, you can simply reduce the representation in time
|
63 |
+
time_reduced_hidden_states = all_layer_hidden_states.mean(-2)
|
64 |
+
print(time_reduced_hidden_states.shape) # [13, 768]
|
65 |
+
|
66 |
+
# You can even use a learnable weighted average representation over all layers
|
67 |
+
aggregator = nn.Conv1d(in_channels=13, out_channels=1, kernel_size=1)
|
68 |
+
weighted_avg_hidden_states = aggregator(time_reduced_hidden_states.unsqueeze(0)).squeeze()
|
69 |
+
print(weighted_avg_hidden_states.shape) # [768]
|
70 |
+
```
|
71 |
+
|
72 |
+
---
|
73 |
+
|
74 |
+
# Ethical Considerations
|
75 |
+
|
76 |
+
This model is released under a non-commercial CC BY-NC 4.0 license and is intended for research purposes. While it is designed to address cultural bias in MIR, its training data and pretraining paradigm may still reflect cultural and dataset-specific biases. The model should not be used in commercial or generative applications without explicit consideration of cultural representation, proper attribution, and consent from relevant communities and dataset curators.
|
77 |
+
|
78 |
+
|
79 |
+
# 📚 Citation
|
80 |
+
|
81 |
+
```shell
|
82 |
+
...TODO
|
83 |
+
```
|
84 |
+
|
85 |
+
---
|