File size: 7,675 Bytes
cda22a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
license: cc-by-nc-sa-4.0
library_name: omar_rq
pipeline_tag: audio-classification
tags:
  - audio
  - music
  - self-supervised-learning
  - masked-language-modeling
---

# OMAR-RQ: Open Music Audio Representation Model Trained with Multi-Feature Masked Token Prediction

This repository contains the model weights and code for **OMAR-RQ**, an Open Music Audio Representation Model trained with Multi-Feature Masked Token Prediction. It was introduced in the paper [OMAR-RQ: Open Music Audio Representation Model Trained with Multi-Feature Masked Token Prediction](https://huggingface.co/papers/2507.03482).

OMAR-RQ is developed to advance research in music audio understanding and provide powerful, multipurpose representations for music information retrieval.

[📄 Paper](https://huggingface.co/papers/2507.03482) | [💻 Code](https://github.com/MTG/OMAR-RQ)

## Abstract
Developing open-source foundation models is essential for advancing research in music audio understanding and ensuring access to powerful, multipurpose representations for music information retrieval. We present OMAR-RQ, a model trained with self-supervision via masked token classification methodologies using a large-scale dataset with over 330,000 hours of music audio. We experiment with different input features and quantization options, and achieve state-of-the-art performance in music tagging, pitch estimation, chord recognition, beat tracking, segmentation, and difficulty estimation among open self-supervised models. We open-source our training and evaluation pipelines and model weights, available at this https URL .

## Installation

For embedding extraction or fine-tuning:

```bash
pip install .
```

For development including pre-training your own models:

```bash
pip install -e .[train]
```

## Inference

Load a model by specifying its Hugging Face model ID:

```python
import torch
from omar_rq import get_model

# Embedding extraction example
x = torch.randn(1, 16000 * 4).cpu()

model_id = "mtg-upf/omar-rq-multifeature-25hz-fsq"
model = get_model(model_id=model_id, device="cpu")

embeddings = model.extract_embeddings(x, layers=[6])

timestamps = torch.arange(embeddings.shape[2]) / model.eps
```

`get_model` reference:

```
Returns an OMAR-RQ Module from the provided  model_id or config_file.

Args:
    model_id (str): Hugging Face's Model ID or local path to the model
    config_file (Path): Path to the model config of a trained model.
    device (str): Device to use for the model. Defaults to "cpu".
    quantization_targets (bool): If True, it will create the quantization
        targets for SSL pre-training of the model. Defaults to False.

Output:
    module: The model from the provided config file.


Module usage:

Args:
    audio (torch.Tensor): 2D mono audio tensor (B, T'). Where B is
        the batch size and T' is the number of samples.
    layers (set): Set of layer indices to extract embeddings from.
        By default, it extracts embeddings from the last layer (logits).

Output:
    torch.Tensor: Extracted embeddings. The output tensor has shape
        (L, B, T, C,) where L = len(layers), B is the batch size, T is
        the number of output timestamps, and C = embedding dimension.


Example:

>>> x = torch.randn(1, 16000 * 4).cpu()
>>>
>>> model = get_model(config_file, device="cpu")
>>>
>>> embeddings = model.extract_embeddings(x, layers=(6))
>>>
>>> # use the `eps` field to compute timestamps
>>> timestamps = torch.arange(embeddings.shape[2]) / model.eps



>> NOTE: The model's embedding rate depends on the model's configuration.
    For example, the melspectrogram model has an embedding rate of 16ms.
    audio should be a sequence with a sample rate as inditacted in the
    config file and up to 30s.
```

`extract_embeddings` reference:

```
Extract embeddings from an input audio batch.

Args:
    audio (torch.Tensor): 2D mono audio tensor (B, T'). Where B is
        the batch size and T' is the number of samples.
    layers (set): Set of layer indices to extract embeddings from.
        By default, it extracts embeddings from the last layer (logits).

Output:
    torch.Tensor: Extracted embeddings. The output tensor has shape
        (L, B, T, C,) where L = len(layers), B is the batch size, T is
        the number of output timestamps, and C = embedding dimension.
```

## Available models

| Model | Input | Rate | Tagging | Difficulty | Pitch | Chord | Beat | Structure |
|---|---|---|---|---|---|---|---|---|
| | | Hz | _mAP_ | _MSE_ | _acc._ | _acc._ | _F1_ | _acc._ |
| **base** | mel | 15.63 | .482 | **1.65** | .892 | .657 | .783 | **.647** |
| **multicodebook** | mel | 15.63 | **.488** | 1.66 | .897 | .675 | .775 | .639 |
| **multifeature** | audio | 18.75 | .467 | 1.76 | .938 | .734 | .833 | .623 |
| **multifeature-25hz** | audio | 25 | .463 | 1.79 | .932 | .728 | .848 | .628 |
| **multifeature-25hz-fsq**| audio | 25 | .463 | 1.71 | **.940**| **.749**| **.855** | .628 |

OMAR-RQ models are offered in different configurations, each with its own strengths and weaknesses. Models based on mel spectrogram (**base** and **multicodebook**) tend to perform better on semantic tasks such as auto-tagging, structure recognition, and difficulty estimation. On the other hand, **multifeature-25hz-fsq** offers the best performance in tonal and temporal tasks such as pitch and chord estimation, and beat tracking.

### Hugging Face Model IDs

- [mtg-upf/omar-rq-base](https://huggingface.co/mtg-upf/omar-rq-base)
- [mtg-upf/omar-rq-multicodebook](https://huggingface.co/mtg-upf/omar-rq-multicodebook)
- [mtg-upf/omar-rq-multifeature](https://huggingface.co/mtg-upf/omar-rq-multifeature)
- [mtg-upf/omar-rq-multifeature-25hz](https://huggingface.co/mtg-upf/omar-rq-multifeature-25hz)
- [mtg-upf/omar-rq-multifeature-25hz-fsq](https://huggingface.co/mtg-upf/omar-rq-multifeature-25hz-fsq)

## Pre-training OMAR-RQ models

1.  Install development dependencies:

    ```bash
    pip install -e .[train]
    ```

2.  Prepare the experiment data

    We downsample our data to 16 kHz mono and store it as 16-bit raw bytes ([numpy memmap](https://numpy.org/doc/stable/reference/generated/numpy.memmap.html) files). Check our [data preparation scripts](data/).

3.  Configuration

    Our experiment configuration is controlled with [gin-config](https://github.com/google/gin-config). Check the default [config file](../cfg/rq_single_view/config.gin) to see the different parameters that can be modified.

    At least the following parameters should be modified:

    -   `DiscotubeMultiViewAudioDataModule.data_dir` -> Your base data folder.
    -   `DiscotubeMultiViewAudioDataModule.filelist_train` -> Filelist of training audio paths relative to the `data_dir` (one file per line).
    -   `DiscotubeMultiViewAudioDataModule.filelist_val` -> Same for the tracks on the validation split.

4.  Run the experiment

    ```bash
    python src/train.py cfg/rq_single_view/config.gin
    ```

## Licensing information

The code in this repository is available under [AGPL-3.0 license](https://www.gnu.org/licenses/agpl-3.0.en.html) license.
The model weights are available under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license for non-commercial applications.
[Contact us](https://www.upf.edu/web/mtg/contact) for more information.

## Citation

If you find our work helpful or inspiring, please feel free to cite it:

```bibtex
@article{omar-rq-2025,
  author={Font-Clos, Francesc and Serra, Xavier},
  title={OMAR-RQ: Open Music Audio Representation Model Trained with Multi-Feature Masked Token Prediction},
  journal={arXiv preprint arXiv:2507.03482},
  year={2025},
}
```