GeistBERT

GeistBERT is a German language model trained on a for the most part deduplicated corpus including OSCAR23, OPUS, and MC4. It builds on GottBERT while introducing Whole Word Masking (WWM) to improve contextual language representation. The model achieves state-of-the-art (SOTA) performance on multiple German NLP benchmarks.

GeistBERT comes in three versions:

Training Data

GeistBERT was trained on a diverse German corpus combining:

  • OSCAR23, OPUS, and MC4 (for the most part deduplicated)
  • German Wikipedia
  • OpenLegalData
  • Europarl, EUbookshop, ECB, and EuroPat
  • OpenSubtitles and TildeMODEL

The dataset amounts to approximately 1.3T tokens, shuffled for improved variance.

Training Procedure

Hardware

  • Training was conducted on multiple GPUs, including NVIDIA RTX 3090 (24GB VRAM).
  • Gradient accumulation was used for Longformer, requiring more VRAM compared to Nyströmformer and RoBERTa, which fit on a single RTX 3090.

Hyperparameters

Parameter Value
Model Architecture RoBERTa (Base)
Batch Size 8,000
Training Steps 100k
Weight Initialization GottBERT filtered base
Warmup Iterations 10k
Peak Learning Rate 0.0007
Learning Rate Decay Polynomial to zero

Performance

GeistBERT achieves SOTA results on multiple tasks:

  • NER: CoNLL 2003, GermEval 2014
  • Text Classification: GermEval 2018 (coarse & fine), 10kGNAD
  • NLI: German subset of XNLI

Mertics:

  • NER and Text Classification: F1 Score
  • NLI: Accuracy

Details:

  • bold values indicate the best performing model within one architecure (base, large), undescored values the second best.
Model Accuracy NLI GermEval_14 F1 CoNLL F1 Coarse F1 Fine F1 10kGNAD F1
GeistBERT 82.67 88.47 86.17 79.67 66.42 90.89
GeistBERT-Nyströmformer 82.50 88.23 85.76 79.17 78.57 90.33
GeistBERT-Longformer 82.51 88.45 86.71 80.56 66.76 90.32
GottBERT_base_best 80.82 87.55 85.93 78.17 53.30 89.64
GottBERT_base_last 81.04 87.48 85.61 78.18 53.92 90.27
GottBERT_filtered_base_best 80.56 87.57 86.14 78.65 52.82 89.79
GottBERT_filtered_base_last 80.74 87.59 85.66 78.08 52.39 89.92
GELECTRA_base 81.70 86.91 85.37 77.26 50.07 89.02
GBERT_base 80.06 87.24 85.16 77.37 51.51 90.30
dbmdzBERT 68.12 86.82 85.15 77.46 52.07 90.34
GermanBERT 78.16 86.53 83.87 74.81 47.78 90.18
XLM-R_base 79.76 86.14 84.46 77.13 50.54 89.81
mBERT 77.03 86.67 83.18 73.54 48.32 88.90
GottBERT_large 82.46 88.20 86.78 79.40 54.61 90.24
GottBERT_filtered_large_best 83.31 88.13 86.30 79.32 54.70 90.31
GottBERT_filtered_large_last 82.79 88.27 86.28 78.96 54.72 90.17
GELECTRA_large 86.33 88.72 86.78 81.28 56.17 90.97
GBERT_large 84.21 88.72 87.19 80.84 57.37 90.74
XLM-R_large 84.07 88.83 86.54 79.05 55.06 90.17

Intended Use

This model is designed for German NLP tasks, including:

  • Text classification
  • Named Entity Recognition (NER)
  • Machine Translation Pre-training
  • Document Understanding

Limitations

  • Trained on unfiltered data, meaning some redundant or lower-quality samples may be present.
  • Longformer requires more VRAM, making it less accessible for smaller GPU setups.
  • While deduplication was applied to specific subcorpora, the full corpus was not manually curated.

Fairseq Checkpoints

Get the fairseq checkpoints here.

Citations

If you use GeistBERT in your research, please cite the following paper:

@misc{scheibleschmitt2025geistbertbreathinglifegerman,
      title={GeistBERT: Breathing Life into German NLP}, 
      author={Raphael Scheible-Schmitt and Johann Frei},
      year={2025},
      eprint={2506.11903},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.11903}, 
}
Downloads last month
688
Safetensors
Model size
126M params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GeistBERT/GeistBERT_base

Finetuned
(1)
this model
Finetunes
2 models