polish-roberta-8k
A Polish language model built on the RoBERTa architecture, supporting context length of up to 8192 tokens. Encoder-type models can be fine-tuned to solve various text prediction tasks such as classification, regression, sequence tagging, or retrieval. In such tasks, they are usually faster and more parameter-efficient than LLMs. This model was initialized from the polish-roberta-large-v2 checkpoint, originally supporting 512-token context, and then adapted to handle longer texts. The model training process was as follows:
- In the first stage, the positional embedding layer was extended from 512 to 8192 tokens. The model was then trained on a corpus of approximately 150 billion tokens for one epoch. All model weights were frozen, and only the positional embedding layer was trained. The goal of this stage was to adapt the new layer without making drastic changes to the rest of the model's weights.
- In the second stage, all model weights were trained for 4 epochs. To improve training efficiency, we added support for Flash Attention 2 and contamination-free packing. Documents were packed into sequences of exactly 8192 tokens.
- Since the model's ability to solve both short and long prediction tasks changed during training, the final model is a merge of three checkpoints from different training stages — specifically from epochs 2, 3, and 4 — which ensured a balanced performance across different types of tasks.
Evaluation - Fine-Tuning
The main advantage of the new model lies in its ability to process longer texts than previously existing Polish encoders. For short classification tasks, the model maintains performance comparable to polish-roberta-large-v2. We analyzed the capabilities of our model by performing fine-tuning on 25 different tasks in Polish. For comparison, we also included the two most popular Polish encoder models — herbert-large-cased and the original polish-roberta-large-v2. For each task and model, we performed five separate training runs with different random seeds, and the reported results are averaged values. For tasks from the KLEJ benchmark, we used hyperparameters from the original fine-tuning scripts for polish-roberta-large-v2. For the remaining tasks, we applied the same hyperparameters for each task: 10 epochs, batch size of 32, a learning rate scheduler with a warmup phase of 6% of the total iterations, a maximum learning rate of 1e-5, and polynomial decay. It is important to note that these results may differ slightly from those on the KLEJ leaderboard, which includes fine-tuning results obtained using different frameworks and sets of hyperparameters. Table 1 presents a summary of our results.
TASK TYPE | DOMAIN | METRIC | GROUP | TASK | herbert-large | polish-roberta-large-v2 | polish-roberta-8k |
---|---|---|---|---|---|---|---|
single-label | mixed | accuracy | KLEJ | NKJP-NER | 96,07 | 95,75 | 95,64 |
single-label | semantics | accuracy | KLEJ | CDSC-E | 94,78 | 94,16 | 94,28 |
regression | semantics | spearman | KLEJ | CDSC-R | 95,01 | 95,25 | 95,33 |
single-label | social media | binary-f1 | KLEJ | CBD | 70,21 | 73,10 | 73,23 |
single-label | reviews | accuracy | KLEJ | POLEMO2.0-IN | 91,39 | 93,55 | 93,05 |
single-label | reviews | accuracy | KLEJ | POLEMO2.0-OUT | 81,66 | 83,81 | 83,64 |
single-label | mixed | binary-f1 | KLEJ | DYK | 73,31 | 74,87 | 74,05 |
single-label | news | binary-f1 | KLEJ | PSC | 98,85 | 98,37 | 98,56 |
regression | reviews | 1-wmae | KLEJ | AR | 89,23 | 89,36 | 88,91 |
single-label | finance | accuracy | FinBench | banking-short | 81,80 | 81,69 | 81,99 |
single-label | finance | accuracy | FinBench | banking-long | 86,64 | 87,89 | 88,35 |
single-label | finance | accuracy | FinBench | banking77 | 92,76 | 92,45 | 92,74 |
regression | finance | r2-score | FinBench | fiqa | 61,20 | 65,71 | 68,43 |
single-label | finance | accuracy | FinBench | fpb | 84,99 | 85,26 | 85,42 |
multi-label | finance | weighted-f1 | FinBench | gcn | 95,25 | 95,04 | 94,97 |
single-label | finance | accuracy | FinBench | stooq | 82,53 | 85,07 | 84,41 |
single-label | social media | accuracy | Other | 8TAGS | 81,16 | 81,64 | 81,44 |
single-label | social media | accuracy | Other | BAN-PL | 93,25 | 93,80 | 93,99 |
multi-label | news | weighted-f1 | Other | MIPD | 66,79 | 67,27 | 68,50 |
single-label | semantics | accuracy | Other | PPC | 89,78 | 89,96 | 89,48 |
single-label | semantics | accuracy | Other | SICK-E | 87,33 | 88,33 | 88,96 |
regression | semantics | spearman | Other | SICK-R | 84,37 | 85,93 | 86,54 |
multi-label | social media | weighted-f1 | Other | TwitterEMO | 70,51 | 70,70 | 70,60 |
single-label | reviews | accuracy | Other | IMDB | 93,55 | 94,36 | 96,03 |
multi-label | law | weighted-f1 | Other | EURLEX | 79,68 | 79,19 | 79,77 |
Table 1. Comparison of the mean scores from five fine-tuning runs on 25 discriminative tasks in Polish. The evaluation metrics vary across tasks — the metric used for each task is specified in the METRIC column.
Among the tasks mentioned above, four involve classification of long texts. In each of these tasks, the new model outperforms the encoders with short context windows. These tasks are: MIPD (classification of disinformation intent in Polish news articles (Modzelewski et al. 2024)), banking-long (topical classification into 14 categories of finance and banking-related articles), IMDB (sentiment analysis of movie reviews - a dataset translated into Polish from the English version of IMDB, keeping only reviews longer than 2,500 characters), and EURLEX (the Polish part of the EUR-Lex corpus, multi-label classification into 21 thematic classes). The results for these long-text tasks are presented in Table 2, along with the average text length in each dataset.
TASK TYPE | DOMAIN | METRIC | AVG. CHARS | TASK | herbert-large | polish-roberta-large-v2 | polish-roberta-8k |
---|---|---|---|---|---|---|---|
multi-label | news | weighted-f1 | 5461 | MIPD | 66,79 | 67,27 | 68,50 |
single-label | finance | accuracy | 2741 | banking-long | 86,64 | 87,89 | 88,35 |
single-label | reviews | accuracy | 3597 | IMDB | 93,55 | 94,36 | 96,03 |
multi-label | law | weighted-f1 | 8255 | EURLEX | 79,68 | 79,19 | 79,77 |
Table 2. Long text classification results.
Evaluation - Reranking for RAG
A common use case for encoder-type models is their application as retrievers or rerankers in retrieval-augmented generation (RAG) systems. Based on this model, we trained a new reranker for the Polish language using the same training procedure as for polish-reranker-roberta-v2. Evaluation on the PIRB benchmark demonstrated a significant improvement in performance on datasets composed of longer texts. Table 3 presents examples of datasets from this benchmark where performance gains were observed.
Task | Improvement using sdadas/mmlw-retrieval-roberta-large retriever |
Improvement using BAAI/bge-multilingual-gemma2 retriever |
---|---|---|
eprawnik | 70.70 ⟶ 76.43 (+5.73) | 73.24 ⟶ 82.36 (+9.12) |
abczdrowie | 53.50 ⟶ 58.20 (+4.70) | 55.12 ⟶ 61.37 (+6.25) |
specprawnik | 43.84 ⟶ 46.06 (+2.22) | 52.00 ⟶ 55.74 (+3.74) |
zapytajfizyka | 96.10 ⟶ 97.44 (+1.34) | 96.09 ⟶ 97.48 (+1.39) |
arguana | 63.12 ⟶ 67.42 (+4.30) | 63.78 ⟶ 67.53 (+3.75) |
quora | 66.61 ⟶ 72.26 (+5.65) | 66.30 ⟶ 74.55 (+8.25) |
Table 3. Comparison between two rerankers based on polish-roberta-large-v2 and polish-roberta-8k on selected long-context tasks from the PIRB benchmark. We report absolute improvement of NDCG@10 score.
Acknowledgements
This project is financed by the European Funds, registered under the number FENG.01.01-IP.01-A028/23-00. It focuses on "Building innovative large language models and a service platform for serving multi-task models within the Bank." The outcomes of this project are the result of a collaboration between the AI Lab at the National Information Processing Institute (Ośrodek Przetwarzania Informacji Państwowy Instytut Badawczy) and the AI Team at PKO Bank Polski.
Funding Amount by the European Funds: 9,2 mln PLN
- Downloads last month
- 5