Create README.md
Browse filesMeMo-BERT-3
MeMo-BERT-3 is a Danish language model built for advanced natural language understanding tasks. It is part of the MeMo-BERT series and has been further pre-trained on top of DanskBERT, leveraging large-scale Danish corpora for improved language representation in a contemporary context.
Model Description
Base model: DanskBERT (Snæbjarnarson et al., 2023)
Original foundation: XLM-RoBERTa (24 layers, 1024 hidden size, 16 attention heads)
Tokenizer: Subword tokenizer with a vocabulary size of 250,000
Language: Danish 🇩🇰
Type: Transformer-based masked language model (MLM)
Pretraining objective: Continued masked language modeling (MLM)
Architecture
MeMo-BERT-3 inherits the architecture from XLM-RoBERTa-large:
24 Transformer layers (encoder blocks)
Hidden size: 1024
Attention heads: 16
Total vocabulary size: 250,000 (subword-based)
Pretraining Details
MeMo-BERT-3 continues pretraining from DanskBERT, which was itself further pre-trained from XLM-RoBERTa using the Contemporary Danish Gigaword Corpus (Strømberg-Derczynski et al., 2021).
Intended Use
MeMo-BERT-3 is intended for use in Danish NLP tasks, such as:
Sentiment Analysis
Text Classification
Question Answering
Token Classification
Language Modeling / Text Generation (masked)
How to Use
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("MiMe-MeMo/MeMo-BERT-03")
model = AutoModel.from_pretrained("MiMe-MeMo/MeMo-BERT-03")
text = "Det danske sprog er rigt og nuanceret."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
Citation
If you use MeMo-BERT-3 in your research, please cite the following works:
@inproceedings{al-laith-etal-2024-development,
title = "Development and Evaluation of Pre-trained Language Models for Historical {D}anish and {N}orwegian Literary Texts",
author = "Al-Laith, Ali and
Conroy, Alexander and
Bjerring-Hansen, Jens and
Hershcovich, Daniel",
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.431/",
pages = "4811--4819",
abstract = "We develop and evaluate the first pre-trained language models specifically tailored for historical Danish and Norwegian texts. Three models are trained on a corpus of 19th-century Danish and Norwegian literature: two directly on the corpus with no prior pre-training, and one with continued pre-training. To evaluate the models, we utilize an existing sentiment classification dataset, and additionally introduce a new annotated word sense disambiguation dataset focusing on the concept of fate. Our assessment reveals that the model employing continued pre-training outperforms the others in two downstream NLP tasks on historical texts. Specifically, we observe substantial improvement in sentiment classification and word sense disambiguation compared to models trained on contemporary texts. These results highlight the effectiveness of continued pre-training for enhancing performance across various NLP tasks in historical text analysis."
}
License
This model is released under the Apache 2.0 License. Please ensure compliance when using or modifying this model.