akanatas commited on
Commit
24241ee
·
verified ·
1 Parent(s): ae24c14

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -1
README.md CHANGED
@@ -12,4 +12,74 @@ pipeline_tag: audio-classification
12
  library_name: transformers
13
  ---
14
 
15
- ...TODO
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  library_name: transformers
13
  ---
14
 
15
+ # CultureMERT: Continual Pre-Training for Cross-Cultural Music Representation Learning
16
+ 📑 [**Read the full paper (to be presented at ISMIR 2025)**](...TODO)
17
+
18
+ ---
19
+
20
+ # 🔧 Model Usage
21
+
22
+ ```python
23
+ from transformers import Wav2Vec2FeatureExtractor, AutoModel
24
+ import torch
25
+ from torch import nn
26
+ import torchaudio.transforms as T
27
+ from datasets import load_dataset
28
+
29
+ # Load model weights and preprocessor config
30
+ model = AutoModel.from_pretrained("ntua-slp/CultureMERT-TA-95M", trust_remote_code=True)
31
+ processor = Wav2Vec2FeatureExtractor.from_pretrained("ntua-slp/CultureMERT-TA-95M", trust_remote_code=True)
32
+
33
+ # Load example audio
34
+ dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation").sort("id")
35
+ audio_array = dataset[0]["audio"]["array"]
36
+ sampling_rate = dataset.features["audio"].sampling_rate
37
+
38
+ # Resample if needed
39
+ resample_rate = processor.sampling_rate
40
+ if resample_rate != sampling_rate:
41
+ print(f'Setting sample rate from {sampling_rate} to {resample_rate}')
42
+ resampler = T.Resample(sampling_rate, resample_rate)
43
+ else:
44
+ resampler = None
45
+
46
+ # Audio file is decoded on the fly
47
+ if resampler is None:
48
+ input_audio = dataset[0]["audio"]["array"]
49
+ else:
50
+ input_audio = resampler(torch.from_numpy(dataset[0]["audio"]["array"]))
51
+
52
+ # Extract hidden states
53
+ inputs = processor(input_audio, sampling_rate=resample_rate, return_tensors="pt")
54
+ with torch.no_grad():
55
+ outputs = model(**inputs, output_hidden_states=True)
56
+
57
+ # Representations: 13 layers (CNN feature extractor + 12 Transformer)
58
+ # NOTE: each layer performs differently in different downstream tasks - you should choose empirically
59
+ all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze()
60
+ print(all_layer_hidden_states.shape) # [13 layer, Time steps, 768 feature_dim]
61
+
62
+ # For utterance-level classification tasks, you can simply reduce the representation in time
63
+ time_reduced_hidden_states = all_layer_hidden_states.mean(-2)
64
+ print(time_reduced_hidden_states.shape) # [13, 768]
65
+
66
+ # You can even use a learnable weighted average representation over all layers
67
+ aggregator = nn.Conv1d(in_channels=13, out_channels=1, kernel_size=1)
68
+ weighted_avg_hidden_states = aggregator(time_reduced_hidden_states.unsqueeze(0)).squeeze()
69
+ print(weighted_avg_hidden_states.shape) # [768]
70
+ ```
71
+
72
+ ---
73
+
74
+ # Ethical Considerations
75
+
76
+ This model is released under a non-commercial CC BY-NC 4.0 license and is intended for research purposes. While it is designed to address cultural bias in MIR, its training data and pretraining paradigm may still reflect cultural and dataset-specific biases. The model should not be used in commercial or generative applications without explicit consideration of cultural representation, proper attribution, and consent from relevant communities and dataset curators.
77
+
78
+
79
+ # 📚 Citation
80
+
81
+ ```shell
82
+ ...TODO
83
+ ```
84
+
85
+ ---