MeMo-BERT-03

MeMo-BERT-03 is a pre-trained language model for historical Danish and Norwegian literary texts (1870–1900).
It was introduced in Al-Laith et al. (2024) as part of the first dedicated PLMs for historical Danish and Norwegian.

Model Description

  • Architecture: XLM-RoBERTa-base (24 layers, 1024 hidden size, 16 heads, vocab size 250k)
  • Pre-training strategy: Continued pre-training of DanskBERT on historical data
  • Training objective: Masked Language Modeling (MLM, 15% masking)
  • Training data: MeMo Corpus v1.1 (839 novels, ~53M words, 1870–1900)
  • Hardware: 2 × A100 GPUs
  • Training time: ~32 hours

Intended Use

  • Primary tasks:

    • Sentiment Analysis (positive, neutral, negative)
    • Word Sense Disambiguation (historical vs. modern senses of skæbne, "fate")
  • Intended users:

    • Researchers in Digital Humanities, Computational Linguistics, and Scandinavian Studies.
    • Historians of literature studying 19th-century Scandinavian novels.
  • Not intended for:

    • Contemporary Danish/Norwegian NLP tasks (performance may degrade).
    • High-stakes applications (e.g., legal, medical, political decision-making).

Training Data

  • Corpus: MeMo Corpus v1.1 (Bjerring-Hansen et al. 2022)
  • Time period: 1870–1900
  • Size: 839 novels, 690 MB, 3.2M sentences, 52.7M words
  • Preprocessing: OCR-corrected, normalized to modern Danish spelling, tokenized, lemmatized, annotated

Evaluation

Benchmarks

Task Dataset Test F1 Notes
Sentiment Analysis MiMe-MeMo/Sentiment-v1 0.77 3-class (pos/neg/neu)
Word Sense Disambiguation MiMe-MeMo/WSD-Skaebne 0.61 4-class (pre-modern, modern, figure of speech, ambiguous)

Comparison

MeMo-BERT-03 outperforms MeMo-BERT-1, MeMo-BERT-2, and contemporary baselines (DanskBERT, ScandiBERT, DanBERT, BotXO) across both tasks.

Limitations

  • Domain-specific: trained only on novels from 1870–1900.
  • May not generalize to other genres (newspapers, folk tales, poetry).
  • Evaluation datasets are relatively small.
  • OCR/normalization errors remain in some texts.

Ethical Considerations

  • All texts are public domain (authors deceased).
  • Datasets released under CC BY 4.0.
  • Word sense annotations created by literary scholars, no sensitive personal data.

Citation

If you use this model, please cite:

@inproceedings{al-laith-etal-2024-development,
    title = "Development and Evaluation of Pre-trained Language Models for Historical {D}anish and {N}orwegian Literary Texts",
    author = "Al-Laith, Ali and Conroy, Alexander and Bjerring-Hansen, Jens and Hershcovich, Daniel",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    pages = "4811--4819",
    url = "https://aclanthology.org/2024.lrec-main.431/"
}
Downloads last month
35
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MiMe-MeMo/MeMo-BERT-03

Finetunes
17 models

Dataset used to train MiMe-MeMo/MeMo-BERT-03

Evaluation results