Overview
This checkpoint continues the pre-training of answerdotai/ModernBERT-large on Scandinavian text, extending the model’s knowledge with ~1.2 trillion additional masked-language-model (MLM) tokens drawn from The Nordic Pile and SWEb while preserving the original 8k token context window.
Our tokenizer is trained from scratch on a subset of 11 985 103 472 tokens.
The training is done in one stage with 8192 tokens per sample for the whole run.
Data Sources
Corpus | Size | Selected Languages | Highlights |
---|---|---|---|
The Nordic Pile | 1.2 TB raw text | sv, no, da, is | Nine diverse categories (CC, Wikipedia, Books, Code, etc.), filtered and deduplicated for high quality |
SWEb | 1 T+ tokens (~3.6 TB) | sv, no, da, is | 98 Common-Crawl snapshots with model-based HTML extraction; 1.2 B documents |
Training Setup
Setting | Value |
---|---|
Parameters | 395 M |
Context length | 8 192 tokens (RoPE + local-global attention) |
Tokens processed | 9.82 × 1011 / 1.20 × 1012 (≈ 82 %) |
Tokens per batch | 1 572 864 |
Global batch | 192 sequences (micro-batch = 3) |
Optimizer & schedule | Decoupled StableAdamW, lr 2 e-4, cosine decay (1 % warm-up) |
Precision | AMP-bf16 |
Hardware | 8 nodes × 8 AMD MI250X GPUs (64 GPUs) on the EuroHPC LUMI-G system |
See training details here
Training Stats
[token=982585522155/1198510347252]:
Train time/batch: 716208
Train time/sample: 137511936
Train time/batch_in_epoch: 716208
Train time/sample_in_epoch: 137511936
Train time/token: 982584117341
Train time/token_in_epoch: 982584117341
Train trainer/device_train_microbatch_size: 3
Train loss/train/total: 0.8162
Train throughput/batches_per_sec: 0.6466
Train throughput/samples_per_sec: 124.1393
Train throughput/device/batches_per_sec: 0.0101
Train throughput/device/samples_per_sec: 1.9397
Train throughput/tokens_per_sec: 887795.9110
Train throughput/device/tokens_per_sec: 13871.8111
Train time/train: 317.5722
Train time/val: 0.0000
Train time/total: 317.5722
Train lr-StableAdamW/group0: 0.0000
Train lr-StableAdamW/group1: 0.0000
Intended Use
- Fill-mask inference, embedding extraction and fine-tuning for Scandinavian downstream NLP tasks (classification, NER, QA, etc.).
- Drop-in replacement for BERT-style encoders (omit
token_type_ids
).
Fill-mask
from transformers import pipeline
unmasker = pipeline('fill-mask', model='AI-Sweden-Models/ModernBERT-large')
unmasker("Huvudstaden i Sverige är [MASK].")
[{'score': 0.5732529759407043,
'token': 2961,
'token_str': ' Stockholm',
'sequence': 'Huvudstaden i Sverige är Stockholm.'},
{'score': 0.06222670152783394,
'token': 4481,
'token_str': ' Göteborg',
'sequence': 'Huvudstaden i Sverige är Göteborg.'},
{'score': 0.02539575845003128,
'token': 5882,
'token_str': ' Malmö',
'sequence': 'Huvudstaden i Sverige är Malmö.'},
{'score': 0.024683712050318718,
'token': 19931,
'token_str': ' Norrköping',
'sequence': 'Huvudstaden i Sverige är Norrköping.'},
{'score': 0.02418600209057331,
'token': 28202,
'token_str': ' Solna',
'sequence': 'Huvudstaden i Sverige är Solna.'}]
Limitations & Biases
- Web corpora can contain noise, stereotypes and sensitive content despite filtering.
- RoPE extrapolation beyond 8 k tokens is untested and may degrade.
Code to reproduce
- Downloads last month
- 1,786
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for AI-Sweden-Models/ModernBERT-large
Base model
answerdotai/ModernBERT-large