polish-roberta-8k

A Polish language model built on the RoBERTa architecture, supporting context length of up to 8192 tokens. Encoder-type models can be fine-tuned to solve various text prediction tasks such as classification, regression, sequence tagging, or retrieval. In such tasks, they are usually faster and more parameter-efficient than LLMs. This model was initialized from the polish-roberta-large-v2 checkpoint, originally supporting 512-token context, and then adapted to handle longer texts. The model training process was as follows:

  • In the first stage, the positional embedding layer was extended from 512 to 8192 tokens. The model was then trained on a corpus of approximately 150 billion tokens for one epoch. All model weights were frozen, and only the positional embedding layer was trained. The goal of this stage was to adapt the new layer without making drastic changes to the rest of the model's weights.
  • In the second stage, all model weights were trained for 4 epochs. To improve training efficiency, we added support for Flash Attention 2 and contamination-free packing. Documents were packed into sequences of exactly 8192 tokens.
  • Since the model's ability to solve both short and long prediction tasks changed during training, the final model is a merge of three checkpoints from different training stages — specifically from epochs 2, 3, and 4 — which ensured a balanced performance across different types of tasks.

Evaluation - Fine-Tuning

The main advantage of the new model lies in its ability to process longer texts than previously existing Polish encoders. For short classification tasks, the model maintains performance comparable to polish-roberta-large-v2. We analyzed the capabilities of our model by performing fine-tuning on 25 different tasks in Polish. For comparison, we also included the two most popular Polish encoder models — herbert-large-cased and the original polish-roberta-large-v2. For each task and model, we performed five separate training runs with different random seeds, and the reported results are averaged values. For tasks from the KLEJ benchmark, we used hyperparameters from the original fine-tuning scripts for polish-roberta-large-v2. For the remaining tasks, we applied the same hyperparameters for each task: 10 epochs, batch size of 32, a learning rate scheduler with a warmup phase of 6% of the total iterations, a maximum learning rate of 1e-5, and polynomial decay. It is important to note that these results may differ slightly from those on the KLEJ leaderboard, which includes fine-tuning results obtained using different frameworks and sets of hyperparameters. Table 1 presents a summary of our results.

TASK TYPE DOMAIN METRIC GROUP TASK herbert-large polish-roberta-large-v2 polish-roberta-8k
single-label mixed accuracy KLEJ NKJP-NER 96,07 95,75 95,64
single-label semantics accuracy KLEJ CDSC-E 94,78 94,16 94,28
regression semantics spearman KLEJ CDSC-R 95,01 95,25 95,33
single-label social media binary-f1 KLEJ CBD 70,21 73,10 73,23
single-label reviews accuracy KLEJ POLEMO2.0-IN 91,39 93,55 93,05
single-label reviews accuracy KLEJ POLEMO2.0-OUT 81,66 83,81 83,64
single-label mixed binary-f1 KLEJ DYK 73,31 74,87 74,05
single-label news binary-f1 KLEJ PSC 98,85 98,37 98,56
regression reviews 1-wmae KLEJ AR 89,23 89,36 88,91
single-label finance accuracy FinBench banking-short 81,80 81,69 81,99
single-label finance accuracy FinBench banking-long 86,64 87,89 88,35
single-label finance accuracy FinBench banking77 92,76 92,45 92,74
regression finance r2-score FinBench fiqa 61,20 65,71 68,43
single-label finance accuracy FinBench fpb 84,99 85,26 85,42
multi-label finance weighted-f1 FinBench gcn 95,25 95,04 94,97
single-label finance accuracy FinBench stooq 82,53 85,07 84,41
single-label social media accuracy Other 8TAGS 81,16 81,64 81,44
single-label social media accuracy Other BAN-PL 93,25 93,80 93,99
multi-label news weighted-f1 Other MIPD 66,79 67,27 68,50
single-label semantics accuracy Other PPC 89,78 89,96 89,48
single-label semantics accuracy Other SICK-E 87,33 88,33 88,96
regression semantics spearman Other SICK-R 84,37 85,93 86,54
multi-label social media weighted-f1 Other TwitterEMO 70,51 70,70 70,60
single-label reviews accuracy Other IMDB 93,55 94,36 96,03
multi-label law weighted-f1 Other EURLEX 79,68 79,19 79,77

Table 1. Comparison of the mean scores from five fine-tuning runs on 25 discriminative tasks in Polish. The evaluation metrics vary across tasks — the metric used for each task is specified in the METRIC column.

Among the tasks mentioned above, four involve classification of long texts. In each of these tasks, the new model outperforms the encoders with short context windows. These tasks are: MIPD (classification of disinformation intent in Polish news articles (Modzelewski et al. 2024)), banking-long (topical classification into 14 categories of finance and banking-related articles), IMDB (sentiment analysis of movie reviews - a dataset translated into Polish from the English version of IMDB, keeping only reviews longer than 2,500 characters), and EURLEX (the Polish part of the EUR-Lex corpus, multi-label classification into 21 thematic classes). The results for these long-text tasks are presented in Table 2, along with the average text length in each dataset.

TASK TYPE DOMAIN METRIC AVG. CHARS TASK herbert-large polish-roberta-large-v2 polish-roberta-8k
multi-label news weighted-f1 5461 MIPD 66,79 67,27 68,50
single-label finance accuracy 2741 banking-long 86,64 87,89 88,35
single-label reviews accuracy 3597 IMDB 93,55 94,36 96,03
multi-label law weighted-f1 8255 EURLEX 79,68 79,19 79,77

Table 2. Long text classification results.

Evaluation - Reranking for RAG

A common use case for encoder-type models is their application as retrievers or rerankers in retrieval-augmented generation (RAG) systems. Based on this model, we trained a new reranker for the Polish language using the same training procedure as for polish-reranker-roberta-v2. Evaluation on the PIRB benchmark demonstrated a significant improvement in performance on datasets composed of longer texts. Table 3 presents examples of datasets from this benchmark where performance gains were observed.

Task Improvement using
sdadas/mmlw-retrieval-roberta-large
retriever
Improvement using
BAAI/bge-multilingual-gemma2
retriever
eprawnik 70.70 ⟶ 76.43 (+5.73) 73.24 ⟶ 82.36 (+9.12)
abczdrowie 53.50 ⟶ 58.20 (+4.70) 55.12 ⟶ 61.37 (+6.25)
specprawnik 43.84 ⟶ 46.06 (+2.22) 52.00 ⟶ 55.74 (+3.74)
zapytajfizyka 96.10 ⟶ 97.44 (+1.34) 96.09 ⟶ 97.48 (+1.39)
arguana 63.12 ⟶ 67.42 (+4.30) 63.78 ⟶ 67.53 (+3.75)
quora 66.61 ⟶ 72.26 (+5.65) 66.30 ⟶ 74.55 (+8.25)

Table 3. Comparison between two rerankers based on polish-roberta-large-v2 and polish-roberta-8k on selected long-context tasks from the PIRB benchmark. We report absolute improvement of NDCG@10 score.

Acknowledgements

This project is financed by the European Funds, registered under the number FENG.01.01-IP.01-A028/23-00. It focuses on "Building innovative large language models and a service platform for serving multi-task models within the Bank." The outcomes of this project are the result of a collaboration between the AI Lab at the National Information Processing Institute (Ośrodek Przetwarzania Informacji Państwowy Instytut Badawczy) and the AI Team at PKO Bank Polski.

Funding Amount by the European Funds: 9,2 mln PLN

Downloads last month
5
Safetensors
Model size
443M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support