MeMo-BERT-01

MeMo-BERT-01 is a pre-trained language model for historical Danish and Norwegian literary texts (1870–1900).
It was introduced in Al-Laith et al. (2024) as part of the first dedicated PLMs for historical Danish and Norwegian.

Model Description

Architecture: BERT-base (12 layers, hidden size 768, 12 attention heads, vocab size 30k)
Pre-training strategy: Trained from scratch on the MeMo corpus (no prior pre-training)
Training objective: Masked Language Modeling (MLM, 15% masking)
Training data: MeMo Corpus v1.1 (839 novels, ~53M words, 1870–1900)
Hardware: 2 × A100 GPUs
Training time: ~44 hours

This model represents the baseline historical-domain model trained entirely on 19th-century Scandinavian novels.

Intended Use

Primary tasks:
- Sentiment Analysis (positive, neutral, negative)
- Word Sense Disambiguation (historical vs. modern senses of skæbne, "fate")
Intended users:
- Researchers in Digital Humanities, Computational Linguistics, and Scandinavian Studies.
- Historians of literature studying 19th-century Scandinavian novels.
Not intended for:
- Contemporary Danish/Norwegian NLP tasks.
- High-stakes applications (e.g., legal, medical, political decision-making).

Training Data

Corpus: MeMo Corpus v1.1 (Bjerring-Hansen et al. 2022)
Time period: 1870–1900
Size: 839 novels, 690 MB, 3.2M sentences, 52.7M words
Preprocessing: OCR-corrected, normalized to modern Danish spelling, tokenized, lemmatized, annotated

Evaluation

Benchmarks

Task	Dataset	Test F1	Notes
Sentiment Analysis	MiMe-MeMo/Sentiment-v1	0.56	3-class (pos/neg/neu)
Word Sense Disambiguation	MiMe-MeMo/WSD-Skaebne	0.43	4-class (pre-modern, modern, figure of speech, ambiguous)

Comparison

MeMo-BERT-01 performs worse than MeMo-BERT-03 (continued pre-training), highlighting the limitations of training from scratch on historical data without leveraging contemporary PLMs.

Limitations

Trained only from scratch on ~53M words (relatively small for BERT training).
Underperforms compared to continued pre-training (MeMo-BERT-03).
Domain-specific to late 19th-century novels.
OCR and normalization errors may remain in training corpus.

Ethical Considerations

All texts are public domain (authors deceased).
Datasets released under CC BY 4.0.
No sensitive personal data involved.

Citation

If you use this model, please cite:

@inproceedings{al-laith-etal-2024-development,
    title = "Development and Evaluation of Pre-trained Language Models for Historical {D}anish and {N}orwegian Literary Texts",
    author = "Al-Laith, Ali and Conroy, Alexander and Bjerring-Hansen, Jens and Hershcovich, Daniel",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    pages = "4811--4819",
    url = "https://aclanthology.org/2024.lrec-main.431/"
}

Downloads last month: 14

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MiMe-MeMo/MeMo-BERT-01

Finetunes

3 models

Dataset used to train MiMe-MeMo/MeMo-BERT-01

Evaluation results

f1 on MiMe-MeMo/Sentiment-v1
self-reported

0.560
f1 on MiMe-MeMo/WSD-Skaebne
self-reported

0.430

View on Papers With Code