polish-reranker-roberta-v3

This model is designed to serve as a Polish reranker in retrieval-augmented generation (RAG) pipelines. It is a general-purpose reranker that delivers strong performance across various document types and domains. It is the successor to sdadas/polish-reranker-roberta-v2 and has the following key features:

The model is based on a new version of polish-roberta that supports long contexts of up to 8192 tokens.
It was trained using knowledge distillation, with a LLM-based reranker (BAAI/bge-reranker-v2.5-gemma2-lightweight) as the teacher model. The training involved the same RankNet loss function and largely the same training corpus as the previous version. The only major change is the addition of a new dataset comprising approximately 40,000 passages from the public administration domain, sourced from Polish government and local municipality websites. Questions for these passages were generated synthetically using DeepSeek-V3-0324.
The model achieves significantly better results on long-context reranking tasks than the previous version. On tasks involving short texts, it maintains performance similar to sdadas/polish-reranker-roberta-v2.
This is an efficient model with only 443M parameters, yet delivers performance competitive with large LLM-based rerankers. For example, on the PIRB benchmark, our model outperforms the multilingual state-of-the-art reranker Qwen/Qwen3-Reranker-8B in 18 out of 41 tasks, while having only 5% of its parameters.

Long context reranking

The main difference between polish-reranker-roberta-v2 and polish-reranker-roberta-v3 is the support for longer contexts - 16x increase from 512 to 8192 tokens. This has a significant impact on reranking quality for datasets with longer documents. Table 1 presents examples of tasks from the PIRB benchmark where substantial improvements were observed as a result of the extended context. For short and medium-length texts, the model achieves comparable or slightly better results than the previous version. For tasks involving very short texts (e.g., onet, czy-wiesz-v2), we observed a slight decrease in quality. The average score on the PIRB benchmark increased from 64.49 to 65.17 when using the sdadas/mmlw-retrieval-roberta-large retriever, or from 65.30 to 66.21 when using BAAI/bge-multilingual-gemma2.

Task	Improvement using sdadas/mmlw-retrieval-roberta-large retriever	Improvement using BAAI/bge-multilingual-gemma2 retriever
eprawnik	70.70 ⟶ 76.43 (+5.73)	73.24 ⟶ 82.36 (+9.12)
abczdrowie	53.50 ⟶ 58.20 (+4.70)	55.12 ⟶ 61.37 (+6.25)
specprawnik	43.84 ⟶ 46.06 (+2.22)	52.00 ⟶ 55.74 (+3.74)
zapytajfizyka	96.10 ⟶ 97.44 (+1.34)	96.09 ⟶ 97.48 (+1.39)
arguana	63.12 ⟶ 67.42 (+4.30)	63.78 ⟶ 67.53 (+3.75)
quora	66.61 ⟶ 72.26 (+5.65)	66.30 ⟶ 74.55 (+8.25)

Table 1. Comparison between sdadas/polish-reranker-roberta-v2 and sdadas/polish-reranker-roberta-v3 on selected long-context tasks from the PIRB benchmark. We report absolute improvement of NDCG@10 score.

Public administration use case

One of the aspects we focused on while building the new model was improving reranking quality for datasets from the government and municipal administration domain. To address this, the training data was expanded with an additional corpus of questions and documents related to public administration. Table 2 presents the evaluation results of selected rerankers on three datasets from this domain. Two of them consist of short questions and answers prepared manually (ezd-qa) or scraped from government and local FAQ websites (opi-urzedowe). The third dataset (ezd-ir-chunked) contains short questions and long passages of up to several thousand tokens, thus requiring the handling of longer context and corresponding to a typical RAG use case. The results for all rerankers were generated using BAAI/bge-m3 as the retriever.

Model	Parameters	Context	opi-urzedowe	ezd-qa	ezd-ir-chunked
sdadas/polish-reranker-large-ranknet	435M	512	85.0	78.2	79.9
sdadas/polish-reranker-roberta-v2	435M	512	86.0	78.2	83.0
Qwen/Qwen3-Reranker-4B	4B	32768	77.4	72.4	80.6
Qwen/Qwen3-Reranker-8B	8B	32768	84.1	79.9	84.8
BAAI/bge-reranker-v2-m3	568M	8192	78.8	74.3	85.4
sdadas/polish-reranker-roberta-v3	443M	8192	86.3	80.1	86.7

Table 2. NDCG@10 scores for selected rerankers on three tasks from the Polish public administration domain.

Usage (Huggingface Transformers)

The model can be used with Huggingface Transformers in the following way:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import numpy as np

query = "Jak dożyć 100 lat?"
answers = [
    "Trzeba zdrowo się odżywiać i uprawiać sport.",
    "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]

model_name = "sdadas/polish-reranker-roberta-v3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="cuda"
)
texts = [f"{query}</s></s>{answer}" for answer in answers]
tokens = tokenizer(texts, padding="longest", max_length=8192, truncation=True, return_tensors="pt").to("cuda")
output = model(**tokens)
results = output.logits.detach().cpu().float().numpy()
results = np.squeeze(results)
print(results.tolist())

Usage (Sentence-Transformers)

The model can also be used in Sentence-Transformers:

import torch.nn
from sentence_transformers import CrossEncoder

query = "Jak dożyć 100 lat?"
answers = [
    "Trzeba zdrowo się odżywiać i uprawiać sport.",
    "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]

model = CrossEncoder(
    "sdadas/polish-reranker-roberta-v3",
    default_activation_function=torch.nn.Identity(),
    max_length=8192,
    device="cuda",
    trust_remote_code=True,
    model_kwargs={"torch_dtype": torch.bfloat16, "attn_implementation": "flash_attention_2"}
)
results = model.predict([[query, answer] for answer in answers])
print(results.tolist())

Evaluation Results

The model achieves NDCG@10 of 66.21 in the Rerankers category of the Polish Information Retrieval Benchmark. See PIRB Leaderboard for detailed results.

Citation

@inproceedings{dadas2024assessing,
  title={Assessing generalization capability of text ranking models in Polish},
  author={Dadas, S{\l}awomir and Grȩbowiec, Ma{\l}gorzata},
  booktitle={International Conference on Artificial Intelligence and Soft Computing},
  pages={37--49},
  year={2024},
  organization={Springer}
}

sdadas
/

polish-reranker-roberta-v3