
Finnish ModernBERT Model Card
Finnish ModernBERT tiny is an encoder model following ModernBERT architecture, pretrained on Finnish, Swedish, English, Code, Latin, and Northern Sámi. It was trained on 448B tokens. Training was conducted on the LUMI supercomputer. The project aimed to train multilingual encoder models that support long context and all official Finnish languages¹. The model can theoretically extrapolate to a context length of 128,000 tokens.
¹Multiple Sámi languages are spoken in Finland, but Northern Sámi is the most widespread and thus included in the training data. English is not the official language of Finland, but it is widely used. Latin was included for potential clinical use.
Table of Contents
- Model Overview
- Training
- Training data
- Evaluation results
- Ethical Considerations and Limitations
- Aknowledgements
- Licence
- Citation information
Model Overview
Hyperparameter | Value |
---|---|
n_parameters | 51M |
n_layers | 6 |
RoPE theta | 10,000 / 1,000,000 |
vocab_size | 27,264 |
sequence_length | 16,000 / 128,000 |
Training
Pretraining was done using Distributed Data Parallelism, AdamW with ZeroRedundancyOptimizer, and the WSD learning rate schedule. The model was trained with a learning rate of 8e-4, a sequence length of 1024, and a RoPE theta of 10,000 for 377B tokens over 117,300 steps.
Long context training
The model was trained with a learning rate of 5e-4, increasing the context length from 1024 to 16,000 in six stages, where each sequence length was trained for an equal number of tokens, totaling 53B tokens over 16,560 steps. RoPE theta in global layers was increased to 1,000,000. Long documents were sampled from the original data in the distribution below:
Sequence lenght | % |
---|---|
<1000 | 21 |
1000-10000 | 78 |
10000-16000 | 1 |
Annealing
For the learning rate decay phase, the dataset was swapped into a high-quality subset.
The RoPE theta and context length were kept the same as in long context training.
The model was annealed for 18B tokens over 4,139 steps using
learning rate decay.
Training data
All pretraining data (excluding the annealing data) were globally exact deduplicated, and PII-removed.
Pretraining data
Data by language
Language | Tokens | % |
---|---|---|
Code | 14.12B | 3.6 |
English | 80.77B | 20.7 |
Finnish | 209.09B | 53.6 |
Latin | 0.94B | 0.3 |
Northern Sámi | 1.07B | 0.3 |
Swedish | 80.09B | 20.5 |
Cross-lingual | 3.98B | 1.0 |
Total | 390B | 100 |
Individual datasets
Language | Dataset | Notes | Sampling fraction | Tokens |
---|---|---|---|---|
Code | Starcoder | GitHub issues | 0.83 | 12.8B |
Code | SmolLM | PythonEdu (score 5) | 30 | 1.4B |
English | Brithish Library | - | 1 | 1.9B |
English | Europarl | English subset | 5 | 0.06B |
English | FineWeb-Edu fortified | - | 0.5 | 69.5B |
English | Natural Instructions | - | 1 | 0.7B |
English | peS2o | - | 0.13 | 51.9B |
English | PubMed Central | - | 0.1 | 22.1B |
English | PubMed Abstracts | - | 1 | 3.8B |
English | Wikipedia | Dump 20241101 | 9 | 3.8B |
Finnish | CC-fi | FinGPT | 4 | 10.8B |
Finnish | CulturaX | Finnish subset | 3.7 | 16.9B |
Finnish | HPLT 2.0 | Finnish subset | 3.7 | 19.1B |
Finnish | nlfcl-fi | Finnish subset | 6 | 0.02B |
Finnish | Europarl | Finnish subset | 6 | 0.12B |
Finnish | Lönnrot | FinGPT | 6 | 0.13B |
Finnish | Reddit-Fi | FinGPT | 6 | 0.11B |
Finnish | Suomi24 | FinGPT | 6 | 3.27B |
Finnish | Wikipedia | Dump 20241101 | 30 | 0.13B |
Finnish | Yle | FinGPT | 30 | 0.22B |
Finnish | Ylilauta | - | 30 | 0.22B |
Latin | CulturaX | Latin subset | 30 | 0.03B |
Northern Sámi | Glot500 | Northern Sámi subset | 30 | 0.004B |
Northern Sámi | saami-web | - | 30 | 0.017B |
Northern Sámi | SALT | - | 30 | 0.015B |
Swedish | CulturaX | Swedish subset | 1.09 | 28.7B |
Swedish | Europarl | Swedish subset | 5 | 0.05B |
Swedish | fstc | - | 5 | 0.002B |
Swedish | HPLT 2.0 | Swedish subset | 1.05 | 35.8B |
Swedish | nlfcl-sv | Swedish subset | 5 | 0.014B |
Swedish | Wikipedia | Dump 20241101 | 30 | 0.27B |
Swedish | Yle | Swedish subset | 30 | 0.27B |
Cross-lingual | Tatoeba | English-Finnish | 0.62 | 1.07B |
Cross-lingual | OPUS | English-Northern Sámi | 30 | 5K |
Cross-lingual | Tatoeba | English-Swedish | 0.57 | 1.15B |
Cross-lingual | Tatoeba | Finnish-English | 0.62 | 1.06B |
Cross-lingual | OPUS | Finnish-Northern Sámi | 30 | 12K |
Cross-lingual | Tatoeba | Finnish-Swedish | 5.7 | 0.12B |
Cross-lingual | OPUS | Northern Sámi-English | 30 | 5K |
Cross-lingual | OPUS | Northern Sámi-Finnish | 30 | 12K |
Cross-lingual | OPUS | Northern Sámi-Swedish | 30 | 0.8K |
Cross-lingual | Tatoeba | Swedish-English | 0.58 | 1.15B |
Cross-lingual | Tatoeba | Swedish-Finnish | 5.7 | 0.12B |
Cross-lingual | OPUS | Swedish-Northern Sámi | 30 | 0.8K |
Annealing data
Details coming soon.
Evaluation results
Complete set of evaluations coming soon. A limited set of assessments using the modified version of EuroEval is presented in the table below. For each model, five learning rates were tested against the validation set, and the F1 score was used as a metric to determine the optimal learning rate. Results are the means of 10 iterations on the bootstrapped versions of the training and test sets.
Results indicate that Finnish ModernBERT is competitive against other multilingual models in short context and performs best in tasks not involving token level predictions.
Finnish
Model | scala-fi | scandisent-fi | turku-ner-fi | tydiqa-fi | Params (M) |
---|---|---|---|---|---|
FacebookAI/xlm-roberta-large | mcc: 50.84±3.76 | macro_f1: 74.32±2.41 | mcc: 90.39±1.12 | macro_f1: 95.18±0.56 | micro_f1_no_misc: 84.31±1.35 | micro_f1: 81.93±1.07 | f1: 56.66±5.70 | em: 35.34±4.34 | 561.2 |
TurkuNLP/bert-base-finnish-cased-v1 | mcc: 47.16±5.27 | macro_f1: 72.98±2.47 | mcc: 90.16±0.50 | macro_f1: 95.08±0.25 | micro_f1_no_misc: 82.04±1.33 | micro_f1: 79.35±0.94 | f1: 56.20±1.42 | em: 35.68±1.82 | 125.2 |
TurkuNLP/bert-large-finnish-cased-v1 | mcc: 58.81±2.46 | macro_f1: 78.91±1.23 | mcc: 91.69±0.60 | macro_f1: 95.85±0.30 | micro_f1_no_misc: 77.57±1.43 | micro_f1: 74.50±1.74 | f1: 59.91±1.19 | em: 39.10±1.18 | 355.2 |
TurkuNLP/finnish-modernbert-base | mcc: 24.81±6.66 | macro_f1: 61.46±3.62 | mcc: 84.59±1.80 | macro_f1: 92.26±0.89 | micro_f1_no_misc: 56.17±4.80 | micro_f1: 56.03±4.91 | f1: 30.04±1.27 | em: 14.22±1.25 | 143.4 |
TurkuNLP/finnish-modernbert-large | mcc: 51.88±3.07 | macro_f1: 75.39±1.91 | mcc: 88.02±2.33 | macro_f1: 93.99±1.18 | micro_f1_no_misc: 71.11±1.83 | micro_f1: 70.47±1.44 | f1: 43.45±2.92 | em: 23.47±2.90 | 401.3 |
TurkuNLP/finnish-modernbert-large-seq-len-1024-117300-annealed | mcc: 49.81±4.13 | macro_f1: 74.58±2.10 | mcc: 88.50±2.88 | macro_f1: 94.22±1.47 | micro_f1_no_misc: 71.16±2.41 | micro_f1: 70.58±2.01 | f1: 42.40±3.43 | em: 22.17±2.78 | 401.3 |
TurkuNLP/finnish-modernbert-tiny | mcc: 4.94±1.95 | macro_f1: 51.89±1.24 | mcc: 76.15±1.93 | macro_f1: 88.05±0.97 | micro_f1_no_misc: 52.45±1.23 | micro_f1: 53.81±1.05 | f1: 29.63±0.42 | em: 14.59±0.58 | 51.6 |
intfloat/multilingual-e5-large | mcc: 12.06±4.33 | macro_f1: 54.51±3.19 | mcc: 90.77±0.70 | macro_f1: 95.37±0.36 | micro_f1_no_misc: 80.55±1.28 | micro_f1: 78.08±1.14 | f1: 60.87±1.77 | em: 39.98±1.78 | 559.9 |
Swedish
Model | scala-sv | scandiqa-sv | suc3 | swerec | Params (M) |
---|---|---|---|---|---|
AI-Sweden-Models/roberta-large-1160k | mcc: 76.24±1.30 | macro_f1: 87.74±0.72 | f1: 53.13±0.86 | em: 46.76±1.08 | micro_f1_no_misc: 79.27±2.28 | micro_f1: 76.65±2.03 | mcc: 77.43±0.65 | macro_f1: 76.11±1.73 | 355.4 |
FacebookAI/xlm-roberta-large | mcc: 72.61±2.84 | macro_f1: 85.79±1.42 | f1: 47.91±1.23 | em: 41.40±1.00 | micro_f1_no_misc: 79.12±1.13 | micro_f1: 76.69±1.14 | mcc: 75.34±0.60 | macro_f1: 70.16±2.52 | 561.2 |
TurkuNLP/finnish-modernbert-base | mcc: 58.79±2.50 | macro_f1: 78.96±1.22 | f1: 29.98±2.03 | em: 23.35±2.22 | micro_f1_no_misc: 51.67±3.10 | micro_f1: 53.42±3.09 | mcc: 63.10±3.20 | macro_f1: 62.47±4.03 | 143.4 |
TurkuNLP/finnish-modernbert-large | mcc: 69.42±3.72 | macro_f1: 84.50±2.01 | f1: 34.26±0.85 | em: 27.46±0.86 | micro_f1_no_misc: 59.99±2.42 | micro_f1: 60.27±2.05 | mcc: 71.01±2.11 | macro_f1: 71.36±1.14 | 401.3 |
TurkuNLP/finnish-modernbert-large-seq-len-1024-117300-annealed | mcc: 66.97±2.66 | macro_f1: 83.38±1.36 | f1: 38.83±2.12 | em: 32.53±2.09 | micro_f1_no_misc: 59.65±1.64 | micro_f1: 59.91±1.33 | mcc: 70.18±3.77 | macro_f1: 69.85±4.05 | 401.3 |
TurkuNLP/finnish-modernbert-tiny | mcc: 11.31±3.88 | macro_f1: 54.81±2.30 | f1: 27.19±0.82 | em: 19.54±0.97 | micro_f1_no_misc: 48.06±2.18 | micro_f1: 49.55±1.87 | mcc: 63.73±1.75 | macro_f1: 63.98±1.64 | 51.6 |
intfloat/multilingual-e5-large | mcc: 49.79±11.17 | macro_f1: 73.39±6.85 | f1: 52.23±0.90 | em: 44.44±1.34 | micro_f1_no_misc: 77.37±1.84 | micro_f1: 75.75±1.76 | mcc: 79.13±1.03 | macro_f1: 77.44±2.85 | 559.9 |
English
Model | conll-en | scala-en | squad | sst5 | Params (M) |
---|---|---|---|---|---|
FacebookAI/xlm-roberta-large | micro_f1_no_misc: 88.74±1.06 | micro_f1: 88.12±0.94 | mcc: 34.33±15.56 | macro_f1: 64.04±9.79 | f1: 70.42±0.84 | em: 57.34±0.82 | mcc: 58.86±1.33 | macro_f1: 58.07±2.23 | 561.2 |
TurkuNLP/finnish-modernbert-base | micro_f1_no_misc: 70.64±2.52 | micro_f1: 72.96±1.99 | mcc: 14.04±3.08 | macro_f1: 56.21±1.86 | f1: 29.36±6.50 | em: 18.20±5.63 | mcc: 33.81±3.80 | macro_f1: 46.50±2.77 | 143.4 |
TurkuNLP/finnish-modernbert-large | micro_f1_no_misc: 79.73±1.29 | micro_f1: 80.90±1.11 | mcc: 50.98±3.90 | macro_f1: 74.94±2.06 | f1: 55.98±2.65 | em: 40.35±2.57 | mcc: 37.08±5.53 | macro_f1: 49.38±4.69 | 401.3 |
TurkuNLP/finnish-modernbert-large-seq-len-1024-117300-annealed | micro_f1_no_misc: 79.15±0.60 | micro_f1: 80.20±0.47 | mcc: 46.82±5.34 | macro_f1: 72.62±2.64 | f1: 58.70±1.98 | em: 42.86±1.95 | mcc: 38.60±3.48 | macro_f1: 51.67±3.58 | 401.3 |
TurkuNLP/finnish-modernbert-tiny | micro_f1_no_misc: 68.71±1.09 | micro_f1: 71.02±0.89 | mcc: 4.72±2.12 | macro_f1: 51.47±1.40 | f1: 12.00±0.47 | em: 4.96±0.43 | mcc: 21.24±4.35 | macro_f1: 40.46±2.94 | 51.6 |
intfloat/multilingual-e5-large | micro_f1_no_misc: 90.83±0.49 | micro_f1: 90.08±0.41 | mcc: 37.27±8.82 | macro_f1: 68.10±4.43 | f1: 72.19±0.85 | em: 58.64±0.76 | mcc: 65.11±0.97 | macro_f1: 64.68±2.38 | 559.9 |
microsoft/deberta-v3-base | micro_f1_no_misc: 91.05±0.53 | micro_f1: 90.46±0.54 | mcc: 64.68±1.29 | macro_f1: 81.85±0.67 | f1: 75.68±0.86 | em: 62.80±0.98 | mcc: 62.03±1.05 | macro_f1: 60.52±3.55 | 183.8 |
Ethical Considerations and Limitations
Finnish ModernBERT may produce representations that reflect biases and patterns present in its training data. The training data were not filtered for toxic, harmful, or offensive content to serve various use cases.
Aknowledgements
We thank CSC, the IT Center for Science in Finland, for the computational resources. We thank The Language Bank of Finland for additional resources for Finnish, Finland-Swedish, and Swedish. This research was also supported by HPLT-project and Finnish Cultural Foundation.
Licence
Finnish ModernBert base is released under the Apache 2.0 license.
Citation information
Preprint coming soon. If you need to cite this work, please use the citation below:
@misc {finnish_modernbert_2025,
author = { Reunamo, Akseli and Pyysalo, Sampo },
title = { Finnish-ModernBERT: A Family of ModernBERTs for Finnish languages },
year = 2025,
url = {https://huggingface.co/collections/TurkuNLP/finnish-modernberts-685bb5f2ab4d39d6480a16d4},
publisher = { Hugging Face }
}
- Downloads last month
- 54