pipeline_tag: audio-classification
library_name: omar_rq
license: cc-by-nc-sa-4.0
tags:
- audio-feature-extraction
- music
OMAR-RQ: Open Music Audio Representation Model Trained with Multi-Feature Masked Token Prediction
This repository contains the model weights for OMAR-RQ, as presented in the paper OMAR-RQ: Open Music Audio Representation Model Trained with Multi-Feature Masked Token Prediction.
Abstract
Developing open-source foundation models is essential for advancing research in music audio understanding and ensuring access to powerful, multipurpose representations for music information retrieval. We present OMAR-RQ, a model trained with self-supervision via masked token classification methodologies using a large-scale dataset with over 330,000 hours of music audio. We experiment with different input features and quantization options, and achieve state-of-the-art performance in music tagging, pitch estimation, chord recognition, beat tracking, segmentation, and difficulty estimation among open self-supervised models. We open-source our training and evaluation pipelines and model weights, available at this https URL .
Code
The training, validation, and inference code, along with further details, is available at the official GitHub repository: https://github.com/MTG/OMAR-RQ.
Inference
You can load an OMAR-RQ model directly by specifying its Hugging Face model ID. First, install the library:
pip install omar-rq
Then, use the following Python code for embedding extraction:
import torch
from omar_rq import get_model
# Embedding extraction example
x = torch.randn(1, 16000 * 4).cpu() # Example audio input (batch_size, samples)
model_id = "mtg-upf/omar-rq-multifeature-25hz-fsq" # This repository's model ID
model = get_model(model_id=model_id, device="cpu")
embeddings = model.extract_embeddings(x, layers=[6])
timestamps = torch.arange(embeddings.shape[2]) / model.eps
print(f"Extracted embeddings shape: {embeddings.shape}")
print(f"Number of timestamps: {len(timestamps)}")
For more details on get_model
and extract_embeddings
usage, please refer to the GitHub repository.
Available Models
OMAR-RQ models are offered in different configurations, each with its own strengths and weaknesses. Models based on mel spectrogram (base and multicodebook) tend to perform better on semantic tasks such as auto-tagging, structure recognition, and difficulty estimation. On the other hand, multifeature-24hz-fsq offers the best performance in tonal and temporal tasks such as pitch and chord estimation, and beat tracking.
Model | Hugging Face ID | Input | Rate | Tagging | Difficulty | Pitch | Chord | Beat | Structure |
---|---|---|---|---|---|---|---|---|---|
Hz | mAP | MSE | acc. | acc. | F1 | acc. | |||
base | mtg-upf/omar-rq-base | mel | 15.63 | .482 | 1.65 | .892 | .657 | .783 | .647 |
multicodebook | mtg-upf/omar-rq-multicodebook | mel | 15.63 | .488 | 1.66 | .897 | .675 | .775 | .639 |
multifeature | mtg-upf/omar-rq-multifeature | audio | 18.75 | .467 | 1.76 | .938 | .734 | .833 | .623 |
multifeature-25hz | mtg-upf/omar-rq-multifeature-25hz | audio | 25 | .463 | 1.79 | .932 | .728 | .848 | .628 |
multifeature-25hz-fsq | mtg-upf/omar-rq-multifeature-25hz-fsq | audio | 25 | .463 | 1.71 | .940 | .749 | .855 | .628 |
License
The code in the OMAR-RQ GitHub repository is available under the AGPL-3.0 license. The model weights on this Hugging Face Hub are released under the CC BY-NC-SA 4.0 license for non-commercial applications.
Citation
If you find this work helpful or inspiring, please feel free to cite it using the following BibTeX entry:
@article{fust,
title={OMAR-RQ: Open Music Audio Representation Model Trained with Multi-Feature Masked Token Prediction},
author={Fust, Albert and Pons, Jordi and Bogdanov, Dmitry and Oñoro-Rubio, Daniel and Gómez, Emilia},
journal={arXiv preprint arXiv:2507.03482},
year={2025}
}