--- library_name: transformers language: - multilingual - af - am - ar - as - az - be - bg - bm - bn - br - bs - ca - cs - cy - da - de - el - en - eo - es - et - eu - fa - ff - fi - fr - fy - ga - gd - gl - gn - gu - ha - he - hi - hr - ht - hu - hy - id - ig - is - it - ja - jv - ka - kg - kk - km - kn - ko - ku - ky - la - lg - ln - lo - lt - lv - mg - mk - ml - mn - mr - ms - my - ne - nl - no - om - or - pa - pl - ps - pt - qu - ro - ru - sa - sd - si - sk - sl - so - sq - sr - ss - su - sv - sw - ta - te - th - ti - tl - tn - tr - uk - ur - uz - vi - wo - xh - yo - zh license: agpl-3.0 tags: - retrieval - entity-retrieval - named-entity-disambiguation - entity-disambiguation - named-entity-linking - entity-linking - text2text-generation --- # Model Card for `impresso-project/nel-mgenre-multilingual` The **Impresso multilingual named entity linking (NEL)** model is based on **mGENRE** (multilingual Generative ENtity REtrieval) proposed by [De Cao et al](https://arxiv.org/abs/2103.12528), a sequence-to-sequence architecture for entity disambiguation based on [mBART](https://arxiv.org/abs/2001.08210). It uses **constrained generation** to output entity names mapped to Wikidata/QIDs. This model was adapted for historical texts and fine-tuned on the [HIPE-2022 dataset](https://github.com/hipe-eval/HIPE-2022-data), which includes a variety of historical document types and languages. ## Model Details ### Model Description ### Model Description - **Developed by:** EPFL from the [Impresso team](https://impresso-project.ch). The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation ([CRSII5_173719](http://p3.snf.ch/project-173719), [CRSII5_213585](https://data.snf.ch/grants/grant/213585)) and the Luxembourg National Research Fund (grant No. 17498891). - **Model type:** mBART-based sequence-to-sequence model with constrained beam search for named entity linking - **Languages:** Multilingual (100+ languages, optimized for French, German, and English) - **License:** [AGPL v3+](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) - **Finetuned from:** [`facebook/mgenre-wiki`](https://huggingface.co/facebook/mgenre-wiki) - ### Model Architecture - **Architecture:** mBART-based seq2seq with constrained beam search ## Training Details ### Training Data The model was trained on the following datasets: | Dataset alias | README | Document type | Languages | Suitable for | Project | License | |---------|---------|---------------|-----------| ---------------|---------------| ---------------| | ajmc | [link](documentation/README-ajmc.md) | classical commentaries | de, fr, en | NERC-Coarse, NERC-Fine, EL | [AjMC](https://mromanello.github.io/ajax-multi-commentary/) | [](https://creativecommons.org/licenses/by/4.0/) | | hipe2020 | [link](documentation/README-hipe2020.md)| historical newspapers | de, fr, en | NERC-Coarse, NERC-Fine, EL | [CLEF-HIPE-2020](https://impresso.github.io/CLEF-HIPE-2020)| [](https://creativecommons.org/licenses/by-nc-sa/4.0/)| | topres19th | [link](documentation/README-topres19th.md) | historical newspapers | en | NERC-Coarse, EL |[Living with Machines](https://livingwithmachines.ac.uk/) | [](https://creativecommons.org/licenses/by-nc-sa/4.0/)| | newseye | [link](documentation/README-newseye.md)| historical newspapers | de, fi, fr, sv | NERC-Coarse, NERC-Fine, EL | [NewsEye](https://www.newseye.eu/) | [](https://creativecommons.org/licenses/by/4.0/)| | sonar | [link](documentation/README-sonar.md) | historical newspapers | de | NERC-Coarse, EL | [SoNAR](https://sonar.fh-potsdam.de/) | [](https://creativecommons.org/licenses/by/4.0/)| ## How to Use ```python from transformers import AutoTokenizer, pipeline NEL_MODEL_NAME = "impresso-project/nel-mgenre-multilingual" nel_tokenizer = AutoTokenizer.from_pretrained(NEL_MODEL_NAME) nel_pipeline = pipeline("generic-nel", model=NEL_MODEL_NAME, tokenizer=nel_tokenizer, trust_remote_code=True, device='cpu') sentence = "Le 0ctobre 1894, [START] Dreyfvs [END] est arrêté à Paris, accusé d'espionnage pour l'Allemagne — un événement qui déch1ra la société fr4nçaise pendant des années." print(nel_pipeline(sentence)) ``` ### Output Format ```python [ { 'surface': 'Dreyfvs', 'wkd_id': 'Q171826', 'wkpedia_pagename': 'Alfred Dreyfus', 'wkpedia_url': 'https://fr.wikipedia.org/wiki/Alfred_Dreyfus', 'type': 'UNK', 'confidence_nel': 99.98, 'lOffset': 24, 'rOffset': 33}] ``` The type of the entity is `UNK` because the model was not trained on the entity type. The `confidence_nel` score indicates the model's confidence in the prediction. ## Use Cases - Entity disambiguation in noisy OCR settings - Linking historical names to modern Wikidata entities - Assisting downstream event extraction and biography generation from historical archives ## Limitations - Sensitive to tokenisation and malformed spans - Accuracy degrades on non-Wikidata entities or in highly ambiguous contexts - Focused on historical entity mentions — performance may vary on modern texts ## Environmental Impact - **Hardware:** 1x A100 (80GB) for finetuning - **Training time:** ~12 hours - **Estimated CO₂ Emissions:** ~2.3 kg CO₂eq ## Contact - Website: [https://impresso-project.ch](https://impresso-project.ch)