MeMo-BERT-01

MeMo-BERT-01 is a pre-trained language model for historical Danish and Norwegian literary texts (1870–1900).
It was introduced in Al-Laith et al. (2024) as part of the first dedicated PLMs for historical Danish and Norwegian.

Model Description

  • Architecture: BERT-base (12 layers, hidden size 768, 12 attention heads, vocab size 30k)
  • Pre-training strategy: Trained from scratch on the MeMo corpus (no prior pre-training)
  • Training objective: Masked Language Modeling (MLM, 15% masking)
  • Training data: MeMo Corpus v1.1 (839 novels, ~53M words, 1870–1900)
  • Hardware: 2 × A100 GPUs
  • Training time: ~44 hours

This model represents the baseline historical-domain model trained entirely on 19th-century Scandinavian novels.

Intended Use

  • Primary tasks:

    • Sentiment Analysis (positive, neutral, negative)
    • Word Sense Disambiguation (historical vs. modern senses of skæbne, "fate")
  • Intended users:

    • Researchers in Digital Humanities, Computational Linguistics, and Scandinavian Studies.
    • Historians of literature studying 19th-century Scandinavian novels.
  • Not intended for:

    • Contemporary Danish/Norwegian NLP tasks.
    • High-stakes applications (e.g., legal, medical, political decision-making).

Training Data

  • Corpus: MeMo Corpus v1.1 (Bjerring-Hansen et al. 2022)
  • Time period: 1870–1900
  • Size: 839 novels, 690 MB, 3.2M sentences, 52.7M words
  • Preprocessing: OCR-corrected, normalized to modern Danish spelling, tokenized, lemmatized, annotated

Evaluation

Benchmarks

Task Dataset Test F1 Notes
Sentiment Analysis MiMe-MeMo/Sentiment-v1 0.56 3-class (pos/neg/neu)
Word Sense Disambiguation MiMe-MeMo/WSD-Skaebne 0.43 4-class (pre-modern, modern, figure of speech, ambiguous)

Comparison

MeMo-BERT-01 performs worse than MeMo-BERT-03 (continued pre-training), highlighting the limitations of training from scratch on historical data without leveraging contemporary PLMs.

Limitations

  • Trained only from scratch on ~53M words (relatively small for BERT training).
  • Underperforms compared to continued pre-training (MeMo-BERT-03).
  • Domain-specific to late 19th-century novels.
  • OCR and normalization errors may remain in training corpus.

Ethical Considerations

  • All texts are public domain (authors deceased).
  • Datasets released under CC BY 4.0.
  • No sensitive personal data involved.

Citation

If you use this model, please cite:

@inproceedings{al-laith-etal-2024-development,
    title = "Development and Evaluation of Pre-trained Language Models for Historical {D}anish and {N}orwegian Literary Texts",
    author = "Al-Laith, Ali and Conroy, Alexander and Bjerring-Hansen, Jens and Hershcovich, Daniel",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    pages = "4811--4819",
    url = "https://aclanthology.org/2024.lrec-main.431/"
}
Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MiMe-MeMo/MeMo-BERT-01

Finetunes
3 models

Dataset used to train MiMe-MeMo/MeMo-BERT-01

Evaluation results