Create README.md
Browse files
    	
        README.md
    ADDED
    
    | 
         @@ -0,0 +1,146 @@ 
     | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
| 
         | 
|
| 1 | 
         
            +
            ---
         
     | 
| 2 | 
         
            +
            pipeline_tag: audio-classification
         
     | 
| 3 | 
         
            +
            library_name: omar_rq
         
     | 
| 4 | 
         
            +
            license: cc-by-nc-sa-4.0
         
     | 
| 5 | 
         
            +
            tags:
         
     | 
| 6 | 
         
            +
              - audio
         
     | 
| 7 | 
         
            +
              - music
         
     | 
| 8 | 
         
            +
              - self-supervised-learning
         
     | 
| 9 | 
         
            +
              - audio-representation
         
     | 
| 10 | 
         
            +
              - music-tagging
         
     | 
| 11 | 
         
            +
              - pitch-estimation
         
     | 
| 12 | 
         
            +
              - chord-recognition
         
     | 
| 13 | 
         
            +
              - beat-tracking
         
     | 
| 14 | 
         
            +
              - segmentation
         
     | 
| 15 | 
         
            +
              - difficulty-estimation
         
     | 
| 16 | 
         
            +
            ---
         
     | 
| 17 | 
         
            +
             
     | 
| 18 | 
         
            +
            # OMAR-RQ: Open Music Audio Representation Model Trained with Multi-Feature Masked Token Prediction
         
     | 
| 19 | 
         
            +
             
     | 
| 20 | 
         
            +
            **OMAR-RQ**  is an open-source foundation model for music audio understanding, presented in the paper [OMAR-RQ: Open Music Audio Representation Model Trained with Multi-Feature Masked Token Prediction](https://huggingface.co/papers/2507.03482).
         
     | 
| 21 | 
         
            +
             
     | 
| 22 | 
         
            +
            OMAR-RQ is trained with self-supervision via masked token classification methodologies using a large-scale dataset with over 330,000 hours of music audio. It offers powerful, multipurpose representations essential for advancing research in music information retrieval. The model achieves state-of-the-art performance among open self-supervised models across various tasks:
         
     | 
| 23 | 
         
            +
            *   **Music Tagging**
         
     | 
| 24 | 
         
            +
            *   **Pitch Estimation**
         
     | 
| 25 | 
         
            +
            *   **Chord Recognition**
         
     | 
| 26 | 
         
            +
            *   **Beat Tracking**
         
     | 
| 27 | 
         
            +
            *   **Segmentation**
         
     | 
| 28 | 
         
            +
            *   **Difficulty Estimation**
         
     | 
| 29 | 
         
            +
             
     | 
| 30 | 
         
            +
            For the full training, validation, and inference code, please refer to the [official GitHub repository](https://github.com/MTG/omar-rq).
         
     | 
| 31 | 
         
            +
             
     | 
| 32 | 
         
            +
            ## Installation
         
     | 
| 33 | 
         
            +
             
     | 
| 34 | 
         
            +
            For embedding extraction or fine-tuning:
         
     | 
| 35 | 
         
            +
             
     | 
| 36 | 
         
            +
            ```bash
         
     | 
| 37 | 
         
            +
            pip install .
         
     | 
| 38 | 
         
            +
            ```
         
     | 
| 39 | 
         
            +
             
     | 
| 40 | 
         
            +
            For development including pre-training your own models:
         
     | 
| 41 | 
         
            +
             
     | 
| 42 | 
         
            +
            ```bash
         
     | 
| 43 | 
         
            +
            pip install -e .[train]
         
     | 
| 44 | 
         
            +
            ```
         
     | 
| 45 | 
         
            +
             
     | 
| 46 | 
         
            +
            ## Inference
         
     | 
| 47 | 
         
            +
             
     | 
| 48 | 
         
            +
            You can load an OMAR-RQ model by specifying its Hugging Face model ID:
         
     | 
| 49 | 
         
            +
             
     | 
| 50 | 
         
            +
            ```python
         
     | 
| 51 | 
         
            +
            import torch
         
     | 
| 52 | 
         
            +
            from omar_rq import get_model
         
     | 
| 53 | 
         
            +
             
     | 
| 54 | 
         
            +
            # Embedding extraction example
         
     | 
| 55 | 
         
            +
            x = torch.randn(1, 16000 * 4).cpu() # Example: 4 seconds of mono audio at 16kHz
         
     | 
| 56 | 
         
            +
             
     | 
| 57 | 
         
            +
            # Load a specific model, e.g., "mtg-upf/omar-rq-multifeature-25hz-fsq"
         
     | 
| 58 | 
         
            +
            model_id = "mtg-upf/omar-rq-multifeature-25hz-fsq" 
         
     | 
| 59 | 
         
            +
            model = get_model(model_id=model_id, device="cpu") # Use "cuda" if a GPU is available
         
     | 
| 60 | 
         
            +
             
     | 
| 61 | 
         
            +
            # Extract embeddings from layer 6
         
     | 
| 62 | 
         
            +
            embeddings = model.extract_embeddings(x, layers=[6])
         
     | 
| 63 | 
         
            +
             
     | 
| 64 | 
         
            +
            # Use the `model.eps` field to compute timestamps for the extracted embeddings
         
     | 
| 65 | 
         
            +
            timestamps = torch.arange(embeddings.shape[2]) / model.eps
         
     | 
| 66 | 
         
            +
             
     | 
| 67 | 
         
            +
            print(f"Extracted embeddings shape: {embeddings.shape}")
         
     | 
| 68 | 
         
            +
            print(f"First 5 timestamps: {timestamps[:5]}")
         
     | 
| 69 | 
         
            +
            ```
         
     | 
| 70 | 
         
            +
             
     | 
| 71 | 
         
            +
            **`get_model` reference:**
         
     | 
| 72 | 
         
            +
             
     | 
| 73 | 
         
            +
            ```
         
     | 
| 74 | 
         
            +
            Returns an OMAR-RQ Module from the provided  model_id or config_file.
         
     | 
| 75 | 
         
            +
             
     | 
| 76 | 
         
            +
            Args:
         
     | 
| 77 | 
         
            +
                model_id (str): Hugging Face's Model ID or local path to the model
         
     | 
| 78 | 
         
            +
                config_file (Path): Path to the model config of a trained model.
         
     | 
| 79 | 
         
            +
                device (str): Device to use for the model. Defaults to "cpu".
         
     | 
| 80 | 
         
            +
                quantization_targets (bool): If True, it will create the quantization
         
     | 
| 81 | 
         
            +
                    targets for SSL pre-training of the model. Defaults to False.
         
     | 
| 82 | 
         
            +
             
     | 
| 83 | 
         
            +
            Output:
         
     | 
| 84 | 
         
            +
                module: The model from the provided config file.
         
     | 
| 85 | 
         
            +
             
     | 
| 86 | 
         
            +
             
     | 
| 87 | 
         
            +
            Module usage:
         
     | 
| 88 | 
         
            +
             
     | 
| 89 | 
         
            +
            Args:
         
     | 
| 90 | 
         
            +
                audio (torch.Tensor): 2D mono audio tensor (B, T'). Where B is
         
     | 
| 91 | 
         
            +
                    the batch size and T' is the number of samples.
         
     | 
| 92 | 
         
            +
                layers (set): Set of layer indices to extract embeddings from.
         
     | 
| 93 | 
         
            +
                    By default, it extracts embeddings from the last layer (logits).
         
     | 
| 94 | 
         
            +
             
     | 
| 95 | 
         
            +
            Output:
         
     | 
| 96 | 
         
            +
                torch.Tensor: Extracted embeddings. The output tensor has shape
         
     | 
| 97 | 
         
            +
                    (L, B, T, C,) where L = len(layers), B is the batch size, T is
         
     | 
| 98 | 
         
            +
                    the number of output timestamps, and C = embedding dimension.
         
     | 
| 99 | 
         
            +
            ```
         
     | 
| 100 | 
         
            +
             
     | 
| 101 | 
         
            +
            **`extract_embeddings` reference:**
         
     | 
| 102 | 
         
            +
             
     | 
| 103 | 
         
            +
            ```
         
     | 
| 104 | 
         
            +
            Extract embeddings from an input audio batch.
         
     | 
| 105 | 
         
            +
             
     | 
| 106 | 
         
            +
            Args:
         
     | 
| 107 | 
         
            +
                audio (torch.Tensor): 2D mono audio tensor (B, T'). Where B is 
         
     | 
| 108 | 
         
            +
                    the batch size and T' is the number of samples.
         
     | 
| 109 | 
         
            +
                layers (set): Set of layer indices to extract embeddings from.
         
     | 
| 110 | 
         
            +
                    By default, it extracts embeddings from the last layer (logits).
         
     | 
| 111 | 
         
            +
             
     | 
| 112 | 
         
            +
            Output:
         
     | 
| 113 | 
         
            +
                torch.Tensor: Extracted embeddings. The output tensor has shape 
         
     | 
| 114 | 
         
            +
                    (L, B, T, C,) where L = len(layers), B is the batch size, T is
         
     | 
| 115 | 
         
            +
                    the number of output timestamps, and C = embedding dimension.
         
     | 
| 116 | 
         
            +
            ```
         
     | 
| 117 | 
         
            +
             
     | 
| 118 | 
         
            +
            ## Available Models
         
     | 
| 119 | 
         
            +
             
     | 
| 120 | 
         
            +
            OMAR-RQ models are offered in different configurations, each with its own strengths and weaknesses. Models based on mel spectrogram (**base** and **multicodebook**) tend to perform better on semantic tasks such as auto-tagging, structure recognition, and difficulty estimation. On the other hand, **multifeature-25hz-fsq** offers the best performance in tonal and temporal tasks such as pitch and chord estimation, and beat tracking.
         
     | 
| 121 | 
         
            +
             
     | 
| 122 | 
         
            +
            | Model                     | Input  | Rate   | Tagging | Difficulty | Pitch    | Chord    | Beat    | Structure | Hugging Face Model ID                                       |
         
     | 
| 123 | 
         
            +
            |:--------------------------|:-------|:-------|:--------|:-----------|:---------|:---------|:--------|:----------|:------------------------------------------------------------|
         
     | 
| 124 | 
         
            +
            |                           |        | Hz     | _mAP_   | _MSE_      | _acc._   | _acc._   | _F1_    | _acc._    |                                                             |
         
     | 
| 125 | 
         
            +
            | **base**                  | mel    | 15.63  | .482    | **1.65**   | .892     | .657     | .783    | **.647**  | [`mtg-upf/omar-rq-base`](https://huggingface.co/mtg-upf/omar-rq-base)                   |
         
     | 
| 126 | 
         
            +
            | **multicodebook**         | mel    | 15.63  | **.488**| 1.66       | .897     | .675     | .775    | .639      | [`mtg-upf/omar-rq-multicodebook`](https://huggingface.co/mtg-upf/omar-rq-multicodebook) |
         
     | 
| 127 | 
         
            +
            | **multifeature**          | audio  | 18.75  | .467    | 1.76       | .938     | .734     | .833    | .623      | [`mtg-upf/omar-rq-multifeature`](https://huggingface.co/mtg-upf/omar-rq-multifeature)   |
         
     | 
| 128 | 
         
            +
            | **multifeature-25hz**     | audio  | 25     | .463    | 1.79       | .932     | .728     | .848    | .628      | [`mtg-upf/omar-rq-multifeature-25hz`](https://huggingface.co/mtg-upf/omar-rq-multifeature-25hz) |
         
     | 
| 129 | 
         
            +
            | **multifeature-25hz-fsq** | audio  | 25     | .463    | 1.71       | **.940** | **.749** | **.855**| .628      | [`mtg-upf/omar-rq-multifeature-25hz-fsq`](https://huggingface.co/mtg-upf/omar-rq-multifeature-25hz-fsq) |
         
     | 
| 130 | 
         
            +
             
     | 
| 131 | 
         
            +
            ## Licensing Information
         
     | 
| 132 | 
         
            +
             
     | 
| 133 | 
         
            +
            The code in the [GitHub repository](https://github.com/MTG/omar-rq) is available under the [AGPL-3.0 license](https://www.gnu.org/licenses/agpl-3.0.en.html). The model weights are available under the [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license for non-commercial applications.
         
     | 
| 134 | 
         
            +
             
     | 
| 135 | 
         
            +
            ## Citation
         
     | 
| 136 | 
         
            +
             
     | 
| 137 | 
         
            +
            If you find this work useful, please cite the paper:
         
     | 
| 138 | 
         
            +
             
     | 
| 139 | 
         
            +
            ```bibtex
         
     | 
| 140 | 
         
            +
            @article {alonso2025omarrq,
         
     | 
| 141 | 
         
            +
              title={OMAR-RQ: Open Music Audio Representation Model Trained with Multi-Feature Masked Token Prediction},
         
     | 
| 142 | 
         
            +
              author={Alonso-Jim\'enez, Pablo and Ramoneda, Pedro and Araz, R. Oguz and Poltronieri, Andrea and Bogdanov, Dmitry},
         
     | 
| 143 | 
         
            +
              journal={arXiv preprint arXiv:2507.03482},
         
     | 
| 144 | 
         
            +
              year={2025}
         
     | 
| 145 | 
         
            +
            }
         
     | 
| 146 | 
         
            +
            ```
         
     |