jina-reranker-m0-gguf
jina-reranker-m0
is a cutting-edge multimodal, multilingual reranker for text, code, image and visual document reranking. Check out its features and benchmarks here. This repo covers how to use its GGUFs and how they’re built.
Usage
We offer jinaai/jina-reranker-m0-GGUF
with various quantization levels on HuggingFace. Dynamic quantization (like Unsloth) is coming soon.
To use these GGUFs, follow these three steps:
Construct your
(QUERY, DOCUMENT)
pair as a prompt:**Document**:\n{DOCUMENT}\n**Query**:\n{QUERY}<|box_end|>
Refer to
test.txt
for the corret batch construction. Note how\n
is NOT the instance separator, but<|box_end|><#sep#>
is.Get the last embedding for this prompt using
llama.cpp
orllama-embedding
. You can changeF16
to other quantizations:- Get a single embedding:
llama-embedding -hf jinaai/jina-reranker-m0-GGUF:F16 \ --pooling last --embd-normalize -1 --embd-separator "<#sep#>" --embd-output-format json \ -p "**Document**:\nWe present ReaderLM-v2\n**Query**:\nslm markdown<|box_end|>"
- Get a batch from a file:
llama-embedding -hf jinaai/jina-reranker-m0-GGUF:F16 \ --pooling last --embd-normalize -1 --embd-separator "<#sep#>" --embd-output-format json \ -f test.txt \ 2>/dev/null >out.json
Due to jina-reranker-m0's design, you must use --pooling last --embd-normalize -1. Also, add --embd-separator "<#sep#>"—llama-embedding defaults to \n as the separator, which breaks multiline docs/queries, so swap it for <#sep#> or something similar.
- Get a single embedding:
Feed
last_embeddings
into a predefined MLP to get the relevance score.import json import numpy as np # reconstruct MLP with np.load('mlp_weights.npz') as data: W1, b1, W2, b2 = data['W1'], data['b1'], data['W2'], data['b2'] logit_bias = float(data['logit_bias'][0]) mlp = lambda x: 1 / (1 + np.exp(-((np.maximum(0, x @ W1 + b1) @ W2 + b2) - logit_bias))) # get embeddings from file with open('out.json') as f: data = json.load(f) embeddings = np.array([item['embedding'] for item in data['data']]) # get relevance score rel_score = mlp(embeddings)
How GGUF Was Built
jina-reranker-m0
builds on Qwen/Qwen2-VL-2B
. But two quirks make it trickier:
First, the model uses token_id=100
as a scoring token at the end of each (query, document) pair to trigger "the scoring state". This token was arbitrarily picked during m0's training, which complicates things for GGUF users familiar with string-level inputs, as it doesn’t play nice with BPE tokenizers. Our fix is that we swapped 100
with <|box_end|>: 151649
in the tokenizer before building GGUFs. So, you’ll need to append <|box_end|>
to each (QUERY
, DOCUMENT
) pair like this:
**Document**:\n{DOCUMENT}\n**Query**:\n{QUERY}<|box_end|>
Second, the scoring MLP isn’t included in the GGUF because llama.cpp
doesn’t support it well. Instead, we dump the MLP into a separate mlp_weights.npz
file. This MLP is a simple two-layer setup with ReLU activation, mapping the last hidden state of <|box_end|>
from 1536 dimensions to a single score. To use it in Python, load and reconstruct the MLP like this:
import numpy as np
with np.load('mlp_weights.npz') as data:
W1, b1, W2, b2 = data['W1'], data['b1'], data['W2'], data['b2']
logit_bias = float(data['logit_bias'][0])
mlp = lambda x: 1 / (1 + np.exp(-((np.maximum(0, x @ W1 + b1) @ W2 + b2) - logit_bias)))
This MLP is lightweight and can easily be moved to a GPU if needed.
Final rerank scores from jina-reranker-m0-GGUF
are calculated as mlp(last_embeddings)
.
- Downloads last month
- 897
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit