jina-reranker-m0-gguf

jina-reranker-m0 is a cutting-edge multimodal, multilingual reranker for text, code, image and visual document reranking. Check out its features and benchmarks here. This repo covers how to use its GGUFs and how they’re built.

Usage

We offer jinaai/jina-reranker-m0-GGUF with various quantization levels on HuggingFace. Dynamic quantization (like Unsloth) is coming soon.

To use these GGUFs, follow these three steps:

  1. Construct your (QUERY, DOCUMENT) pair as a prompt:

    **Document**:\n{DOCUMENT}\n**Query**:\n{QUERY}<|box_end|>
    

    Refer to test.txt for the corret batch construction. Note how \n is NOT the instance separator, but <|box_end|><#sep#> is.

  2. Get the last embedding for this prompt using llama.cpp or llama-embedding. You can change F16 to other quantizations:

    • Get a single embedding:
      llama-embedding -hf jinaai/jina-reranker-m0-GGUF:F16 \
                      --pooling last --embd-normalize -1 --embd-separator "<#sep#>" --embd-output-format json \
                      -p "**Document**:\nWe present ReaderLM-v2\n**Query**:\nslm markdown<|box_end|>"
      
    • Get a batch from a file:
      llama-embedding -hf jinaai/jina-reranker-m0-GGUF:F16 \
                      --pooling last --embd-normalize -1 --embd-separator "<#sep#>" --embd-output-format json \
                      -f test.txt \
                      2>/dev/null >out.json
      

    Due to jina-reranker-m0's design, you must use --pooling last --embd-normalize -1. Also, add --embd-separator "<#sep#>"—llama-embedding defaults to \n as the separator, which breaks multiline docs/queries, so swap it for <#sep#> or something similar.

  3. Feed last_embeddings into a predefined MLP to get the relevance score.

    import json
    import numpy as np
    
    # reconstruct MLP
    with np.load('mlp_weights.npz') as data:
        W1, b1, W2, b2 = data['W1'], data['b1'], data['W2'], data['b2']
        logit_bias = float(data['logit_bias'][0])
    
    mlp = lambda x: 1 / (1 + np.exp(-((np.maximum(0, x @ W1 + b1) @ W2 + b2) - logit_bias)))
    
    # get embeddings from file 
    with open('out.json') as f:
        data = json.load(f)
    embeddings = np.array([item['embedding'] for item in data['data']])
    
    # get relevance score
    rel_score = mlp(embeddings)
    

How GGUF Was Built

jina-reranker-m0 builds on Qwen/Qwen2-VL-2B. But two quirks make it trickier:

First, the model uses token_id=100 as a scoring token at the end of each (query, document) pair to trigger "the scoring state". This token was arbitrarily picked during m0's training, which complicates things for GGUF users familiar with string-level inputs, as it doesn’t play nice with BPE tokenizers. Our fix is that we swapped 100 with <|box_end|>: 151649 in the tokenizer before building GGUFs. So, you’ll need to append <|box_end|> to each (QUERY, DOCUMENT) pair like this:

**Document**:\n{DOCUMENT}\n**Query**:\n{QUERY}<|box_end|>

Second, the scoring MLP isn’t included in the GGUF because llama.cpp doesn’t support it well. Instead, we dump the MLP into a separate mlp_weights.npz file. This MLP is a simple two-layer setup with ReLU activation, mapping the last hidden state of <|box_end|> from 1536 dimensions to a single score. To use it in Python, load and reconstruct the MLP like this:

import numpy as np
with np.load('mlp_weights.npz') as data:
    W1, b1, W2, b2 = data['W1'], data['b1'], data['W2'], data['b2']
    logit_bias = float(data['logit_bias'][0])

mlp = lambda x: 1 / (1 + np.exp(-((np.maximum(0, x @ W1 + b1) @ W2 + b2) - logit_bias)))

This MLP is lightweight and can easily be moved to a GPU if needed.
Final rerank scores from jina-reranker-m0-GGUF are calculated as mlp(last_embeddings).

Downloads last month
897
GGUF
Model size
1.54B params
Architecture
qwen2vl
Hardware compatibility
Log In to view the estimation

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jinaai/jina-reranker-m0-GGUF

Base model

Qwen/Qwen2-VL-2B
Quantized
(3)
this model

Collection including jinaai/jina-reranker-m0-GGUF