--- language: - da - no license: cc-by-4.0 datasets: - MiMe-MeMo/Corpus-v1.1 - MiMe-MeMo/Sentiment-v1 - MiMe-MeMo/WSD-Skaebne metrics: - f1 tags: - historical-texts - digital-humanities - sentiment-analysis - word-sense-disambiguation - danish - norwegian model-index: - name: MeMo-BERT-03 results: - task: type: text-classification name: Sentiment Analysis dataset: name: MiMe-MeMo/Sentiment-v1 type: text metrics: - name: f1 type: f1 value: 0.77 - task: type: text-classification name: Word Sense Disambiguation dataset: name: MiMe-MeMo/WSD-Skaebne type: text metrics: - name: f1 type: f1 value: 0.61 --- # MeMo-BERT-03 **MeMo-BERT-03** is a pre-trained language model for **historical Danish and Norwegian literary texts** (1870–1900). It was introduced in [Al-Laith et al. (2024)](https://aclanthology.org/2024.lrec-main.431/) as part of the first dedicated PLMs for historical Danish and Norwegian. ## Model Description - **Architecture:** XLM-RoBERTa-base (24 layers, 1024 hidden size, 16 heads, vocab size 250k) - **Pre-training strategy:** Continued pre-training of [DanskBERT](https://huggingface.co/vesteinn/DanskBERT) on historical data - **Training objective:** Masked Language Modeling (MLM, 15% masking) - **Training data:** MeMo Corpus v1.1 (839 novels, ~53M words, 1870–1900) - **Hardware:** 2 × A100 GPUs - **Training time:** ~32 hours ## Intended Use - **Primary tasks:** - Sentiment Analysis (positive, neutral, negative) - Word Sense Disambiguation (historical vs. modern senses of *skæbne*, "fate") - **Intended users:** - Researchers in Digital Humanities, Computational Linguistics, and Scandinavian Studies. - Historians of literature studying 19th-century Scandinavian novels. - **Not intended for:** - Contemporary Danish/Norwegian NLP tasks (performance may degrade). - High-stakes applications (e.g., legal, medical, political decision-making). ## Training Data - **Corpus:** [MeMo Corpus v1.1](https://huggingface.co/datasets/MiMe-MeMo/Corpus-v1.1) (Bjerring-Hansen et al. 2022) - **Time period:** 1870–1900 - **Size:** 839 novels, 690 MB, 3.2M sentences, 52.7M words - **Preprocessing:** OCR-corrected, normalized to modern Danish spelling, tokenized, lemmatized, annotated ## Evaluation ### Benchmarks | Task | Dataset | Test F1 | Notes | |------|---------|---------|-------| | Sentiment Analysis | MiMe-MeMo/Sentiment-v1 | **0.77** | 3-class (pos/neg/neu) | | Word Sense Disambiguation | MiMe-MeMo/WSD-Skaebne | **0.61** | 4-class (pre-modern, modern, figure of speech, ambiguous) | ### Comparison MeMo-BERT-03 outperforms MeMo-BERT-1, MeMo-BERT-2, and contemporary baselines (DanskBERT, ScandiBERT, DanBERT, BotXO) across both tasks. ## Limitations - Domain-specific: trained only on **novels from 1870–1900**. - May not generalize to other genres (newspapers, folk tales, poetry). - Evaluation datasets are relatively small. - OCR/normalization errors remain in some texts. ## Ethical Considerations - All texts are **public domain** (authors deceased). - Datasets released under **CC BY 4.0**. - Word sense annotations created by literary scholars, no sensitive personal data. ## Citation If you use this model, please cite: ```bibtex @inproceedings{al-laith-etal-2024-development, title = "Development and Evaluation of Pre-trained Language Models for Historical {D}anish and {N}orwegian Literary Texts", author = "Al-Laith, Ali and Conroy, Alexander and Bjerring-Hansen, Jens and Hershcovich, Daniel", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", year = "2024", address = "Torino, Italia", publisher = "ELRA and ICCL", pages = "4811--4819", url = "https://aclanthology.org/2024.lrec-main.431/" }