metadata

language:
  - da
  - 'no'
license: cc-by-4.0
datasets:
  - MiMe-MeMo/Corpus-v1.1
  - MiMe-MeMo/Sentiment-v1
  - MiMe-MeMo/WSD-Skaebne
metrics:
  - f1
tags:
  - historical-texts
  - digital-humanities
  - sentiment-analysis
  - word-sense-disambiguation
  - danish
  - norwegian
model-index:
  - name: MeMo-BERT-03
    results:
      - task:
          type: text-classification
          name: Sentiment Analysis
        dataset:
          name: MiMe-MeMo/Sentiment-v1
          type: text
        metrics:
          - name: f1
            type: f1
            value: 0.77
      - task:
          type: text-classification
          name: Word Sense Disambiguation
        dataset:
          name: MiMe-MeMo/WSD-Skaebne
          type: text
        metrics:
          - name: f1
            type: f1
            value: 0.61

MeMo-BERT-03

MeMo-BERT-03 is a pre-trained language model for historical Danish and Norwegian literary texts (1870–1900).
It was introduced in Al-Laith et al. (2024) as part of the first dedicated PLMs for historical Danish and Norwegian.

Model Description

Architecture: XLM-RoBERTa-base (24 layers, 1024 hidden size, 16 heads, vocab size 250k)
Pre-training strategy: Continued pre-training of DanskBERT on historical data
Training objective: Masked Language Modeling (MLM, 15% masking)
Training data: MeMo Corpus v1.1 (839 novels, ~53M words, 1870–1900)
Hardware: 2 × A100 GPUs
Training time: ~32 hours

Intended Use

Primary tasks:
- Sentiment Analysis (positive, neutral, negative)
- Word Sense Disambiguation (historical vs. modern senses of skæbne, "fate")
Intended users:
- Researchers in Digital Humanities, Computational Linguistics, and Scandinavian Studies.
- Historians of literature studying 19th-century Scandinavian novels.
Not intended for:
- Contemporary Danish/Norwegian NLP tasks (performance may degrade).
- High-stakes applications (e.g., legal, medical, political decision-making).

Training Data

Corpus: MeMo Corpus v1.1 (Bjerring-Hansen et al. 2022)
Time period: 1870–1900
Size: 839 novels, 690 MB, 3.2M sentences, 52.7M words
Preprocessing: OCR-corrected, normalized to modern Danish spelling, tokenized, lemmatized, annotated

Evaluation

Benchmarks

Task	Dataset	Test F1	Notes
Sentiment Analysis	MiMe-MeMo/Sentiment-v1	0.77	3-class (pos/neg/neu)
Word Sense Disambiguation	MiMe-MeMo/WSD-Skaebne	0.61	4-class (pre-modern, modern, figure of speech, ambiguous)

Comparison

MeMo-BERT-03 outperforms MeMo-BERT-1, MeMo-BERT-2, and contemporary baselines (DanskBERT, ScandiBERT, DanBERT, BotXO) across both tasks.

Limitations

Domain-specific: trained only on novels from 1870–1900.
May not generalize to other genres (newspapers, folk tales, poetry).
Evaluation datasets are relatively small.
OCR/normalization errors remain in some texts.

Ethical Considerations

All texts are public domain (authors deceased).
Datasets released under CC BY 4.0.
Word sense annotations created by literary scholars, no sensitive personal data.

Citation

If you use this model, please cite:

@inproceedings{al-laith-etal-2024-development,
    title = "Development and Evaluation of Pre-trained Language Models for Historical {D}anish and {N}orwegian Literary Texts",
    author = "Al-Laith, Ali and Conroy, Alexander and Bjerring-Hansen, Jens and Hershcovich, Daniel",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    pages = "4811--4819",
    url = "https://aclanthology.org/2024.lrec-main.431/"
}