Create README.md

MeMo-BERT-3
MeMo-BERT-3 is a Danish language model built for advanced natural language understanding tasks. It is part of the MeMo-BERT series and has been further pre-trained on top of DanskBERT, leveraging large-scale Danish corpora for improved language representation in a contemporary context.

Model Description
Base model: DanskBERT (Snæbjarnarson et al., 2023)

Original foundation: XLM-RoBERTa (24 layers, 1024 hidden size, 16 attention heads)

Tokenizer: Subword tokenizer with a vocabulary size of 250,000

Language: Danish 🇩🇰

Type: Transformer-based masked language model (MLM)

Pretraining objective: Continued masked language modeling (MLM)

Architecture
MeMo-BERT-3 inherits the architecture from XLM-RoBERTa-large:

24 Transformer layers (encoder blocks)

Hidden size: 1024

Attention heads: 16

Total vocabulary size: 250,000 (subword-based)

Pretraining Details
MeMo-BERT-3 continues pretraining from DanskBERT, which was itself further pre-trained from XLM-RoBERTa using the Contemporary Danish Gigaword Corpus (Strømberg-Derczynski et al., 2021).

Intended Use
MeMo-BERT-3 is intended for use in Danish NLP tasks, such as:

Sentiment Analysis

Text Classification

Question Answering

Token Classification

Language Modeling / Text Generation (masked)

How to Use

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("MiMe-MeMo/MeMo-BERT-03")
model = AutoModel.from_pretrained("MiMe-MeMo/MeMo-BERT-03")

text = "Det danske sprog er rigt og nuanceret."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Citation
If you use MeMo-BERT-3 in your research, please cite the following works:
@inproceedings{al-laith-etal-2024-development,
title = "Development and Evaluation of Pre-trained Language Models for Historical {D}anish and {N}orwegian Literary Texts",
author = "Al-Laith, Ali and
Conroy, Alexander and
Bjerring-Hansen, Jens and
Hershcovich, Daniel",
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.431/",
pages = "4811--4819",
abstract = "We develop and evaluate the first pre-trained language models specifically tailored for historical Danish and Norwegian texts. Three models are trained on a corpus of 19th-century Danish and Norwegian literature: two directly on the corpus with no prior pre-training, and one with continued pre-training. To evaluate the models, we utilize an existing sentiment classification dataset, and additionally introduce a new annotated word sense disambiguation dataset focusing on the concept of fate. Our assessment reveals that the model employing continued pre-training outperforms the others in two downstream NLP tasks on historical texts. Specifically, we observe substantial improvement in sentiment classification and word sense disambiguation compared to models trained on contemporary texts. These results highlight the effectiveness of continued pre-training for enhancing performance across various NLP tasks in historical text analysis."
}

License
This model is released under the Apache 2.0 License. Please ensure compliance when using or modifying this model.

Files changed (1) hide show

README.md +11 -0

README.md ADDED Viewed

	@@ -0,0 +1,11 @@

+---
+license: apache-2.0
+datasets:
+- MiMe-MeMo/Corpus-v1.1
+language:
+- da
+- 'no'
+base_model:
+- vesteinn/DanskBERT
+pipeline_tag: fill-mask
+---