Fill-Mask
Transformers
Safetensors
modernbert
masked-lm
long-context

Overview

This checkpoint continues the pre-training of answerdotai/ModernBERT-large on Scandinavian text, extending the model’s knowledge with ~1.2 trillion additional masked-language-model (MLM) tokens drawn from The Nordic Pile and SWEb while preserving the original 8k token context window.

Our tokenizer is trained from scratch on a subset of 11 985 103 472 tokens.

The training is done in one stage with 8192 tokens per sample for the whole run.

Data Sources

Corpus Size Selected Languages Highlights
The Nordic Pile 1.2 TB raw text sv, no, da, is Nine diverse categories (CC, Wikipedia, Books, Code, etc.), filtered and deduplicated for high quality
SWEb 1 T+ tokens (~3.6 TB) sv, no, da, is 98 Common-Crawl snapshots with model-based HTML extraction; 1.2 B documents

Training Setup

Setting Value
Parameters 395 M
Context length 8 192 tokens (RoPE + local-global attention)
Tokens processed 9.82 × 1011 / 1.20 × 1012 (≈ 82 %)
Tokens per batch 1 572 864
Global batch 192 sequences (micro-batch = 3)
Optimizer & schedule Decoupled StableAdamW, lr 2 e-4, cosine decay (1 % warm-up)
Precision AMP-bf16
Hardware 8 nodes × 8 AMD MI250X GPUs (64 GPUs) on the EuroHPC LUMI-G system

See training details here

Training Stats

[token=982585522155/1198510347252]:
     Train time/batch: 716208
     Train time/sample: 137511936
     Train time/batch_in_epoch: 716208
     Train time/sample_in_epoch: 137511936
     Train time/token: 982584117341
     Train time/token_in_epoch: 982584117341
     Train trainer/device_train_microbatch_size: 3
     Train loss/train/total: 0.8162
     Train throughput/batches_per_sec: 0.6466
     Train throughput/samples_per_sec: 124.1393
     Train throughput/device/batches_per_sec: 0.0101
     Train throughput/device/samples_per_sec: 1.9397
     Train throughput/tokens_per_sec: 887795.9110
     Train throughput/device/tokens_per_sec: 13871.8111
     Train time/train: 317.5722
     Train time/val: 0.0000
     Train time/total: 317.5722
     Train lr-StableAdamW/group0: 0.0000
     Train lr-StableAdamW/group1: 0.0000

Intended Use

  • Fill-mask inference, embedding extraction and fine-tuning for Scandinavian downstream NLP tasks (classification, NER, QA, etc.).
  • Drop-in replacement for BERT-style encoders (omit token_type_ids).

Fill-mask

from transformers import pipeline
unmasker = pipeline('fill-mask', model='AI-Sweden-Models/ModernBERT-large')
unmasker("Huvudstaden i Sverige är [MASK].")
[{'score': 0.5732529759407043,
  'token': 2961,
  'token_str': ' Stockholm',
  'sequence': 'Huvudstaden i Sverige är  Stockholm.'},
 {'score': 0.06222670152783394,
  'token': 4481,
  'token_str': ' Göteborg',
  'sequence': 'Huvudstaden i Sverige är  Göteborg.'},
 {'score': 0.02539575845003128,
  'token': 5882,
  'token_str': ' Malmö',
  'sequence': 'Huvudstaden i Sverige är  Malmö.'},
 {'score': 0.024683712050318718,
  'token': 19931,
  'token_str': ' Norrköping',
  'sequence': 'Huvudstaden i Sverige är  Norrköping.'},
 {'score': 0.02418600209057331,
  'token': 28202,
  'token_str': ' Solna',
  'sequence': 'Huvudstaden i Sverige är  Solna.'}]

Limitations & Biases

  • Web corpora can contain noise, stereotypes and sensitive content despite filtering.
  • RoPE extrapolation beyond 8 k tokens is untested and may degrade.

Code to reproduce

Downloads last month
1,786
Safetensors
Model size
396M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AI-Sweden-Models/ModernBERT-large

Finetuned
(145)
this model