jina-reranker-m0-gguf

jina-reranker-m0 is a cutting-edge multimodal, multilingual reranker for text, code, image and visual document reranking. Check out its features and benchmarks here. This repo covers how to use its GGUFs and how they’re built.

Usage

We offer jinaai/jina-reranker-m0-GGUF with various quantization levels on HuggingFace. Dynamic quantization (like Unsloth) is coming soon.

To use these GGUFs, follow these three steps:

Construct your (QUERY, DOCUMENT) pair as a prompt:
```
**Document**:\n{DOCUMENT}\n**Query**:\n{QUERY}<|box_end|>
```
Refer to test.txt for the corret batch construction. Note how \n is NOT the instance separator, but <|box_end|><#sep#> is.

Get the last embedding for this prompt using llama.cpp or llama-embedding. You can change F16 to other quantizations:

Get a single embedding:

llama-embedding -hf jinaai/jina-reranker-m0-GGUF:F16 \
                --pooling last --embd-normalize -1 --embd-separator "<#sep#>" --embd-output-format json \
                -p "**Document**:\nWe present ReaderLM-v2\n**Query**:\nslm markdown<|box_end|>"

Get a batch from a file:

llama-embedding -hf jinaai/jina-reranker-m0-GGUF:F16 \
                --pooling last --embd-normalize -1 --embd-separator "<#sep#>" --embd-output-format json \
                -f test.txt \
                2>/dev/null >out.json

Due to jina-reranker-m0's design, you must use --pooling last --embd-normalize -1. Also, add --embd-separator "<#sep#>"—llama-embedding defaults to \n as the separator, which breaks multiline docs/queries, so swap it for <#sep#> or something similar.

Feed last_embeddings into a predefined MLP to get the relevance score.

import json
import numpy as np

# reconstruct MLP
with np.load('mlp_weights.npz') as data:
    W1, b1, W2, b2 = data['W1'], data['b1'], data['W2'], data['b2']
    logit_bias = float(data['logit_bias'][0])

mlp = lambda x: 1 / (1 + np.exp(-((np.maximum(0, x @ W1 + b1) @ W2 + b2) - logit_bias)))

# get embeddings from file 
with open('out.json') as f:
    data = json.load(f)
embeddings = np.array([item['embedding'] for item in data['data']])

# get relevance score
rel_score = mlp(embeddings)

How GGUF Was Built

jina-reranker-m0 builds on Qwen/Qwen2-VL-2B. But two quirks make it trickier:

First, the model uses token_id=100 as a scoring token at the end of each (query, document) pair to trigger "the scoring state". This token was arbitrarily picked during m0's training, which complicates things for GGUF users familiar with string-level inputs, as it doesn’t play nice with BPE tokenizers. Our fix is that we swapped 100 with <|box_end|>: 151649 in the tokenizer before building GGUFs. So, you’ll need to append <|box_end|> to each (QUERY, DOCUMENT) pair like this:

**Document**:\n{DOCUMENT}\n**Query**:\n{QUERY}<|box_end|>

Second, the scoring MLP isn’t included in the GGUF because llama.cpp doesn’t support it well. Instead, we dump the MLP into a separate mlp_weights.npz file. This MLP is a simple two-layer setup with ReLU activation, mapping the last hidden state of <|box_end|> from 1536 dimensions to a single score. To use it in Python, load and reconstruct the MLP like this:

import numpy as np
with np.load('mlp_weights.npz') as data:
    W1, b1, W2, b2 = data['W1'], data['b1'], data['W2'], data['b2']
    logit_bias = float(data['logit_bias'][0])

mlp = lambda x: 1 / (1 + np.exp(-((np.maximum(0, x @ W1 + b1) @ W2 + b2) - logit_bias)))

This MLP is lightweight and can easily be moved to a GPU if needed.
Final rerank scores from jina-reranker-m0-GGUF are calculated as mlp(last_embeddings).

jinaai
/

jina-reranker-m0-GGUF

jina-reranker-m0-gguf

Usage

How GGUF Was Built

Model tree for jinaai/jina-reranker-m0-GGUF

Collection including jinaai/jina-reranker-m0-GGUF

jina-reranker-m0