polish-roberta-8k

A Polish language model built on the RoBERTa architecture, supporting context length of up to 8192 tokens. Encoder-type models can be fine-tuned to solve various text prediction tasks such as classification, regression, sequence tagging, or retrieval. In such tasks, they are usually faster and more parameter-efficient than LLMs. This model was initialized from the polish-roberta-large-v2 checkpoint, originally supporting 512-token context, and then adapted to handle longer texts. The model training process was as follows:

In the first stage, the positional embedding layer was extended from 512 to 8192 tokens. The model was then trained on a corpus of approximately 150 billion tokens for one epoch. All model weights were frozen, and only the positional embedding layer was trained. The goal of this stage was to adapt the new layer without making drastic changes to the rest of the model's weights.
In the second stage, all model weights were trained for 4 epochs. To improve training efficiency, we added support for Flash Attention 2 and contamination-free packing. Documents were packed into sequences of exactly 8192 tokens.
Since the model's ability to solve both short and long prediction tasks changed during training, the final model is a merge of three checkpoints from different training stages — specifically from epochs 2, 3, and 4 — which ensured a balanced performance across different types of tasks.

Evaluation - Fine-Tuning

The main advantage of the new model lies in its ability to process longer texts than previously existing Polish encoders. For short classification tasks, the model maintains performance comparable to polish-roberta-large-v2. We analyzed the capabilities of our model by performing fine-tuning on 25 different tasks in Polish. For comparison, we also included the two most popular Polish encoder models — herbert-large-cased and the original polish-roberta-large-v2. For each task and model, we performed five separate training runs with different random seeds, and the reported results are averaged values. For tasks from the KLEJ benchmark, we used hyperparameters from the original fine-tuning scripts for polish-roberta-large-v2. For the remaining tasks, we applied the same hyperparameters for each task: 10 epochs, batch size of 32, a learning rate scheduler with a warmup phase of 6% of the total iterations, a maximum learning rate of 1e-5, and polynomial decay. It is important to note that these results may differ slightly from those on the KLEJ leaderboard, which includes fine-tuning results obtained using different frameworks and sets of hyperparameters. Table 1 presents a summary of our results.

TASK TYPE	DOMAIN	METRIC	GROUP	TASK	herbert-large	polish-roberta-large-v2	polish-roberta-8k
single-label	mixed	accuracy	KLEJ	NKJP-NER	96,07	95,75	95,64
single-label	semantics	accuracy	KLEJ	CDSC-E	94,78	94,16	94,28
regression	semantics	spearman	KLEJ	CDSC-R	95,01	95,25	95,33
single-label	social media	binary-f1	KLEJ	CBD	70,21	73,10	73,23
single-label	reviews	accuracy	KLEJ	POLEMO2.0-IN	91,39	93,55	93,05
single-label	reviews	accuracy	KLEJ	POLEMO2.0-OUT	81,66	83,81	83,64
single-label	mixed	binary-f1	KLEJ	DYK	73,31	74,87	74,05
single-label	news	binary-f1	KLEJ	PSC	98,85	98,37	98,56
regression	reviews	1-wmae	KLEJ	AR	89,23	89,36	88,91
single-label	finance	accuracy	FinBench	banking-short	81,80	81,69	81,99
single-label	finance	accuracy	FinBench	banking-long	86,64	87,89	88,35
single-label	finance	accuracy	FinBench	banking77	92,76	92,45	92,74
regression	finance	r2-score	FinBench	fiqa	61,20	65,71	68,43
single-label	finance	accuracy	FinBench	fpb	84,99	85,26	85,42
multi-label	finance	weighted-f1	FinBench	gcn	95,25	95,04	94,97
single-label	finance	accuracy	FinBench	stooq	82,53	85,07	84,41
single-label	social media	accuracy	Other	8TAGS	81,16	81,64	81,44
single-label	social media	accuracy	Other	BAN-PL	93,25	93,80	93,99
multi-label	news	weighted-f1	Other	MIPD	66,79	67,27	68,50
single-label	semantics	accuracy	Other	PPC	89,78	89,96	89,48
single-label	semantics	accuracy	Other	SICK-E	87,33	88,33	88,96
regression	semantics	spearman	Other	SICK-R	84,37	85,93	86,54
multi-label	social media	weighted-f1	Other	TwitterEMO	70,51	70,70	70,60
single-label	reviews	accuracy	Other	IMDB	93,55	94,36	96,03
multi-label	law	weighted-f1	Other	EURLEX	79,68	79,19	79,77

Table 1. Comparison of the mean scores from five fine-tuning runs on 25 discriminative tasks in Polish. The evaluation metrics vary across tasks — the metric used for each task is specified in the METRIC column.

Among the tasks mentioned above, four involve classification of long texts. In each of these tasks, the new model outperforms the encoders with short context windows. These tasks are: MIPD (classification of disinformation intent in Polish news articles (Modzelewski et al. 2024)), banking-long (topical classification into 14 categories of finance and banking-related articles), IMDB (sentiment analysis of movie reviews - a dataset translated into Polish from the English version of IMDB, keeping only reviews longer than 2,500 characters), and EURLEX (the Polish part of the EUR-Lex corpus, multi-label classification into 21 thematic classes). The results for these long-text tasks are presented in Table 2, along with the average text length in each dataset.

TASK TYPE	DOMAIN	METRIC	AVG. CHARS	TASK	herbert-large	polish-roberta-large-v2	polish-roberta-8k
multi-label	news	weighted-f1	5461	MIPD	66,79	67,27	68,50
single-label	finance	accuracy	2741	banking-long	86,64	87,89	88,35
single-label	reviews	accuracy	3597	IMDB	93,55	94,36	96,03
multi-label	law	weighted-f1	8255	EURLEX	79,68	79,19	79,77

Table 2. Long text classification results.

Evaluation - Reranking for RAG

A common use case for encoder-type models is their application as retrievers or rerankers in retrieval-augmented generation (RAG) systems. Based on this model, we trained a new reranker for the Polish language using the same training procedure as for polish-reranker-roberta-v2. Evaluation on the PIRB benchmark demonstrated a significant improvement in performance on datasets composed of longer texts. Table 3 presents examples of datasets from this benchmark where performance gains were observed.

Task	Improvement using sdadas/mmlw-retrieval-roberta-large retriever	Improvement using BAAI/bge-multilingual-gemma2 retriever
eprawnik	70.70 ⟶ 76.43 (+5.73)	73.24 ⟶ 82.36 (+9.12)
abczdrowie	53.50 ⟶ 58.20 (+4.70)	55.12 ⟶ 61.37 (+6.25)
specprawnik	43.84 ⟶ 46.06 (+2.22)	52.00 ⟶ 55.74 (+3.74)
zapytajfizyka	96.10 ⟶ 97.44 (+1.34)	96.09 ⟶ 97.48 (+1.39)
arguana	63.12 ⟶ 67.42 (+4.30)	63.78 ⟶ 67.53 (+3.75)
quora	66.61 ⟶ 72.26 (+5.65)	66.30 ⟶ 74.55 (+8.25)

Table 3. Comparison between two rerankers based on polish-roberta-large-v2 and polish-roberta-8k on selected long-context tasks from the PIRB benchmark. We report absolute improvement of NDCG@10 score.

Acknowledgements

This project is financed by the European Funds, registered under the number FENG.01.01-IP.01-A028/23-00. It focuses on "Building innovative large language models and a service platform for serving multi-task models within the Bank." The outcomes of this project are the result of a collaboration between the AI Lab at the National Information Processing Institute (Ośrodek Przetwarzania Informacji Państwowy Instytut Badawczy) and the AI Team at PKO Bank Polski.

Funding Amount by the European Funds: 9,2 mln PLN