AI-Sweden-Models/ModernBERT-large

Overview

This checkpoint continues the pre-training of answerdotai/ModernBERT-large on Scandinavian text, extending the model’s knowledge with ~1.2 trillion additional masked-language-model (MLM) tokens drawn from The Nordic Pile and SWEb while preserving the original 8k token context window.

Our tokenizer is trained from scratch on a subset of 11 985 103 472 tokens.

The training is done in one stage with 8192 tokens per sample for the whole run.

Data Sources

Corpus	Size	Selected Languages	Highlights
The Nordic Pile	1.2 TB raw text	sv, no, da, is	Nine diverse categories (CC, Wikipedia, Books, Code, etc.), filtered and deduplicated for high quality
SWEb	1 T+ tokens (~3.6 TB)	sv, no, da, is	98 Common-Crawl snapshots with model-based HTML extraction; 1.2 B documents

Training Setup

Setting	Value
Parameters	395 M
Context length	8 192 tokens (RoPE + local-global attention)
Tokens processed	9.82 × 10¹¹ / 1.20 × 10¹² (≈ 82 %)
Tokens per batch	1 572 864
Global batch	192 sequences (micro-batch = 3)
Optimizer & schedule	Decoupled StableAdamW, lr 2 e-4, cosine decay (1 % warm-up)
Precision	AMP-bf16
Hardware	8 nodes × 8 AMD MI250X GPUs (64 GPUs) on the EuroHPC LUMI-G system

See training details here

Training Stats

[token=982585522155/1198510347252]:
     Train time/batch: 716208
     Train time/sample: 137511936
     Train time/batch_in_epoch: 716208
     Train time/sample_in_epoch: 137511936
     Train time/token: 982584117341
     Train time/token_in_epoch: 982584117341
     Train trainer/device_train_microbatch_size: 3
     Train loss/train/total: 0.8162
     Train throughput/batches_per_sec: 0.6466
     Train throughput/samples_per_sec: 124.1393
     Train throughput/device/batches_per_sec: 0.0101
     Train throughput/device/samples_per_sec: 1.9397
     Train throughput/tokens_per_sec: 887795.9110
     Train throughput/device/tokens_per_sec: 13871.8111
     Train time/train: 317.5722
     Train time/val: 0.0000
     Train time/total: 317.5722
     Train lr-StableAdamW/group0: 0.0000
     Train lr-StableAdamW/group1: 0.0000

Intended Use

Fill-mask inference, embedding extraction and fine-tuning for Scandinavian downstream NLP tasks (classification, NER, QA, etc.).
Drop-in replacement for BERT-style encoders (omit token_type_ids).

Fill-mask

from transformers import pipeline
unmasker = pipeline('fill-mask', model='AI-Sweden-Models/ModernBERT-large')
unmasker("Huvudstaden i Sverige är [MASK].")

[{'score': 0.5732529759407043,
  'token': 2961,
  'token_str': ' Stockholm',
  'sequence': 'Huvudstaden i Sverige är  Stockholm.'},
 {'score': 0.06222670152783394,
  'token': 4481,
  'token_str': ' Göteborg',
  'sequence': 'Huvudstaden i Sverige är  Göteborg.'},
 {'score': 0.02539575845003128,
  'token': 5882,
  'token_str': ' Malmö',
  'sequence': 'Huvudstaden i Sverige är  Malmö.'},
 {'score': 0.024683712050318718,
  'token': 19931,
  'token_str': ' Norrköping',
  'sequence': 'Huvudstaden i Sverige är  Norrköping.'},
 {'score': 0.02418600209057331,
  'token': 28202,
  'token_str': ' Solna',
  'sequence': 'Huvudstaden i Sverige är  Solna.'}]

Limitations & Biases

Web corpora can contain noise, stereotypes and sensitive content despite filtering.
RoPE extrapolation beyond 8 k tokens is untested and may degrade.

AI-Sweden-Models
/

ModernBERT-large