File size: 6,660 Bytes
8ea4fdb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---
pipeline_tag: audio-classification
library_name: omar_rq
license: cc-by-nc-sa-4.0
tags:
  - audio
  - music
  - self-supervised-learning
  - audio-representation
  - music-tagging
  - pitch-estimation
  - chord-recognition
  - beat-tracking
  - segmentation
  - difficulty-estimation
---

# OMAR-RQ: Open Music Audio Representation Model Trained with Multi-Feature Masked Token Prediction

**OMAR-RQ**  is an open-source foundation model for music audio understanding, presented in the paper [OMAR-RQ: Open Music Audio Representation Model Trained with Multi-Feature Masked Token Prediction](https://huggingface.co/papers/2507.03482).

OMAR-RQ is trained with self-supervision via masked token classification methodologies using a large-scale dataset with over 330,000 hours of music audio. It offers powerful, multipurpose representations essential for advancing research in music information retrieval. The model achieves state-of-the-art performance among open self-supervised models across various tasks:
*   **Music Tagging**
*   **Pitch Estimation**
*   **Chord Recognition**
*   **Beat Tracking**
*   **Segmentation**
*   **Difficulty Estimation**

For the full training, validation, and inference code, please refer to the [official GitHub repository](https://github.com/MTG/omar-rq).

## Installation

For embedding extraction or fine-tuning:

```bash
pip install .
```

For development including pre-training your own models:

```bash
pip install -e .[train]
```

## Inference

You can load an OMAR-RQ model by specifying its Hugging Face model ID:

```python
import torch
from omar_rq import get_model

# Embedding extraction example
x = torch.randn(1, 16000 * 4).cpu() # Example: 4 seconds of mono audio at 16kHz

# Load a specific model, e.g., "mtg-upf/omar-rq-multifeature-25hz-fsq"
model_id = "mtg-upf/omar-rq-multifeature-25hz-fsq" 
model = get_model(model_id=model_id, device="cpu") # Use "cuda" if a GPU is available

# Extract embeddings from layer 6
embeddings = model.extract_embeddings(x, layers=[6])

# Use the `model.eps` field to compute timestamps for the extracted embeddings
timestamps = torch.arange(embeddings.shape[2]) / model.eps

print(f"Extracted embeddings shape: {embeddings.shape}")
print(f"First 5 timestamps: {timestamps[:5]}")
```

**`get_model` reference:**

```
Returns an OMAR-RQ Module from the provided  model_id or config_file.

Args:
    model_id (str): Hugging Face's Model ID or local path to the model
    config_file (Path): Path to the model config of a trained model.
    device (str): Device to use for the model. Defaults to "cpu".
    quantization_targets (bool): If True, it will create the quantization
        targets for SSL pre-training of the model. Defaults to False.

Output:
    module: The model from the provided config file.


Module usage:

Args:
    audio (torch.Tensor): 2D mono audio tensor (B, T'). Where B is
        the batch size and T' is the number of samples.
    layers (set): Set of layer indices to extract embeddings from.
        By default, it extracts embeddings from the last layer (logits).

Output:
    torch.Tensor: Extracted embeddings. The output tensor has shape
        (L, B, T, C,) where L = len(layers), B is the batch size, T is
        the number of output timestamps, and C = embedding dimension.
```

**`extract_embeddings` reference:**

```
Extract embeddings from an input audio batch.

Args:
    audio (torch.Tensor): 2D mono audio tensor (B, T'). Where B is 
        the batch size and T' is the number of samples.
    layers (set): Set of layer indices to extract embeddings from.
        By default, it extracts embeddings from the last layer (logits).

Output:
    torch.Tensor: Extracted embeddings. The output tensor has shape 
        (L, B, T, C,) where L = len(layers), B is the batch size, T is
        the number of output timestamps, and C = embedding dimension.
```

## Available Models

OMAR-RQ models are offered in different configurations, each with its own strengths and weaknesses. Models based on mel spectrogram (**base** and **multicodebook**) tend to perform better on semantic tasks such as auto-tagging, structure recognition, and difficulty estimation. On the other hand, **multifeature-25hz-fsq** offers the best performance in tonal and temporal tasks such as pitch and chord estimation, and beat tracking.

| Model                     | Input  | Rate   | Tagging | Difficulty | Pitch    | Chord    | Beat    | Structure | Hugging Face Model ID                                       |
|:--------------------------|:-------|:-------|:--------|:-----------|:---------|:---------|:--------|:----------|:------------------------------------------------------------|
|                           |        | Hz     | _mAP_   | _MSE_      | _acc._   | _acc._   | _F1_    | _acc._    |                                                             |
| **base**                  | mel    | 15.63  | .482    | **1.65**   | .892     | .657     | .783    | **.647**  | [`mtg-upf/omar-rq-base`](https://huggingface.co/mtg-upf/omar-rq-base)                   |
| **multicodebook**         | mel    | 15.63  | **.488**| 1.66       | .897     | .675     | .775    | .639      | [`mtg-upf/omar-rq-multicodebook`](https://huggingface.co/mtg-upf/omar-rq-multicodebook) |
| **multifeature**          | audio  | 18.75  | .467    | 1.76       | .938     | .734     | .833    | .623      | [`mtg-upf/omar-rq-multifeature`](https://huggingface.co/mtg-upf/omar-rq-multifeature)   |
| **multifeature-25hz**     | audio  | 25     | .463    | 1.79       | .932     | .728     | .848    | .628      | [`mtg-upf/omar-rq-multifeature-25hz`](https://huggingface.co/mtg-upf/omar-rq-multifeature-25hz) |
| **multifeature-25hz-fsq** | audio  | 25     | .463    | 1.71       | **.940** | **.749** | **.855**| .628      | [`mtg-upf/omar-rq-multifeature-25hz-fsq`](https://huggingface.co/mtg-upf/omar-rq-multifeature-25hz-fsq) |

## Licensing Information

The code in the [GitHub repository](https://github.com/MTG/omar-rq) is available under the [AGPL-3.0 license](https://www.gnu.org/licenses/agpl-3.0.en.html). The model weights are available under the [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license for non-commercial applications.

## Citation

If you find this work useful, please cite the paper:

```bibtex
@article {alonso2025omarrq,
  title={OMAR-RQ: Open Music Audio Representation Model Trained with Multi-Feature Masked Token Prediction},
  author={Alonso-Jim\'enez, Pablo and Ramoneda, Pedro and Araz, R. Oguz and Poltronieri, Andrea and Bogdanov, Dmitry},
  journal={arXiv preprint arXiv:2507.03482},
  year={2025}
}
```