yemen2016 commited on
Commit
988424f
·
verified ·
1 Parent(s): c1492c3

Update readme file

Browse files
Files changed (1) hide show
  1. README.md +117 -9
README.md CHANGED
@@ -1,11 +1,119 @@
1
  ---
2
- license: apache-2.0
3
- datasets:
4
- - MiMe-MeMo/Corpus-v1.1
5
  language:
6
- - da
7
- - 'no'
8
- base_model:
9
- - vesteinn/DanskBERT
10
- pipeline_tag: fill-mask
11
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
2
  language:
3
+ - da
4
+ - no
5
+ license: cc-by-4.0
6
+ datasets:
7
+ - MiMe-MeMo/Corpus-v1.1
8
+ - MiMe-MeMo/Sentiment-v1
9
+ - MiMe-MeMo/WSD-Skaebne
10
+ metrics:
11
+ - f1
12
+ tags:
13
+ - historical-texts
14
+ - digital-humanities
15
+ - sentiment-analysis
16
+ - word-sense-disambiguation
17
+ - danish
18
+ - norwegian
19
+ model-index:
20
+ - name: MeMo-BERT-03
21
+ results:
22
+ - task:
23
+ type: text-classification
24
+ name: Sentiment Analysis
25
+ dataset:
26
+ name: MiMe-MeMo/Sentiment-v1
27
+ type: text
28
+ metrics:
29
+ - name: f1
30
+ type: f1
31
+ value: 0.77
32
+ - task:
33
+ type: text-classification
34
+ name: Word Sense Disambiguation
35
+ dataset:
36
+ name: MiMe-MeMo/WSD-Skaebne
37
+ type: text
38
+ metrics:
39
+ - name: f1
40
+ type: f1
41
+ value: 0.61
42
+ ---
43
+
44
+ # MeMo-BERT-03
45
+
46
+ **MeMo-BERT-03** is a pre-trained language model for **historical Danish and Norwegian literary texts** (1870–1900).
47
+ It was introduced in [Al-Laith et al. (2024)](https://aclanthology.org/2024.lrec-main.431/) as part of the first dedicated PLMs for historical Danish and Norwegian.
48
+
49
+ ## Model Description
50
+
51
+ - **Architecture:** XLM-RoBERTa-base (24 layers, 1024 hidden size, 16 heads, vocab size 250k)
52
+ - **Pre-training strategy:** Continued pre-training of [DanskBERT](https://huggingface.co/vesteinn/DanskBERT) on historical data
53
+ - **Training objective:** Masked Language Modeling (MLM, 15% masking)
54
+ - **Training data:** MeMo Corpus v1.1 (839 novels, ~53M words, 1870–1900)
55
+ - **Hardware:** 2 × A100 GPUs
56
+ - **Training time:** ~32 hours
57
+
58
+ ## Intended Use
59
+
60
+ - **Primary tasks:**
61
+ - Sentiment Analysis (positive, neutral, negative)
62
+ - Word Sense Disambiguation (historical vs. modern senses of *skæbne*, "fate")
63
+
64
+ - **Intended users:**
65
+ - Researchers in Digital Humanities, Computational Linguistics, and Scandinavian Studies.
66
+ - Historians of literature studying 19th-century Scandinavian novels.
67
+
68
+ - **Not intended for:**
69
+ - Contemporary Danish/Norwegian NLP tasks (performance may degrade).
70
+ - High-stakes applications (e.g., legal, medical, political decision-making).
71
+
72
+ ## Training Data
73
+
74
+ - **Corpus:** [MeMo Corpus v1.1](https://huggingface.co/datasets/MiMe-MeMo/Corpus-v1.1) (Bjerring-Hansen et al. 2022)
75
+ - **Time period:** 1870–1900
76
+ - **Size:** 839 novels, 690 MB, 3.2M sentences, 52.7M words
77
+ - **Preprocessing:** OCR-corrected, normalized to modern Danish spelling, tokenized, lemmatized, annotated
78
+
79
+ ## Evaluation
80
+
81
+ ### Benchmarks
82
+
83
+ | Task | Dataset | Test F1 | Notes |
84
+ |------|---------|---------|-------|
85
+ | Sentiment Analysis | MiMe-MeMo/Sentiment-v1 | **0.77** | 3-class (pos/neg/neu) |
86
+ | Word Sense Disambiguation | MiMe-MeMo/WSD-Skaebne | **0.61** | 4-class (pre-modern, modern, figure of speech, ambiguous) |
87
+
88
+ ### Comparison
89
+
90
+ MeMo-BERT-03 outperforms MeMo-BERT-1, MeMo-BERT-2, and contemporary baselines (DanskBERT, ScandiBERT, DanBERT, BotXO) across both tasks.
91
+
92
+ ## Limitations
93
+
94
+ - Domain-specific: trained only on **novels from 1870–1900**.
95
+ - May not generalize to other genres (newspapers, folk tales, poetry).
96
+ - Evaluation datasets are relatively small.
97
+ - OCR/normalization errors remain in some texts.
98
+
99
+ ## Ethical Considerations
100
+
101
+ - All texts are **public domain** (authors deceased).
102
+ - Datasets released under **CC BY 4.0**.
103
+ - Word sense annotations created by literary scholars, no sensitive personal data.
104
+
105
+ ## Citation
106
+
107
+ If you use this model, please cite:
108
+
109
+ ```bibtex
110
+ @inproceedings{al-laith-etal-2024-development,
111
+ title = "Development and Evaluation of Pre-trained Language Models for Historical {D}anish and {N}orwegian Literary Texts",
112
+ author = "Al-Laith, Ali and Conroy, Alexander and Bjerring-Hansen, Jens and Hershcovich, Daniel",
113
+ booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
114
+ year = "2024",
115
+ address = "Torino, Italia",
116
+ publisher = "ELRA and ICCL",
117
+ pages = "4811--4819",
118
+ url = "https://aclanthology.org/2024.lrec-main.431/"
119
+ }