emanuelaboros's picture
review readme
e36effa
metadata
library_name: transformers
language:
  - multilingual
  - af
  - am
  - ar
  - as
  - az
  - be
  - bg
  - bm
  - bn
  - br
  - bs
  - ca
  - cs
  - cy
  - da
  - de
  - el
  - en
  - eo
  - es
  - et
  - eu
  - fa
  - ff
  - fi
  - fr
  - fy
  - ga
  - gd
  - gl
  - gn
  - gu
  - ha
  - he
  - hi
  - hr
  - ht
  - hu
  - hy
  - id
  - ig
  - is
  - it
  - ja
  - jv
  - ka
  - kg
  - kk
  - km
  - kn
  - ko
  - ku
  - ky
  - la
  - lg
  - ln
  - lo
  - lt
  - lv
  - mg
  - mk
  - ml
  - mn
  - mr
  - ms
  - my
  - ne
  - nl
  - 'no'
  - om
  - or
  - pa
  - pl
  - ps
  - pt
  - qu
  - ro
  - ru
  - sa
  - sd
  - si
  - sk
  - sl
  - so
  - sq
  - sr
  - ss
  - su
  - sv
  - sw
  - ta
  - te
  - th
  - ti
  - tl
  - tn
  - tr
  - uk
  - ur
  - uz
  - vi
  - wo
  - xh
  - yo
  - zh
license: agpl-3.0
tags:
  - retrieval
  - entity-retrieval
  - named-entity-disambiguation
  - entity-disambiguation
  - named-entity-linking
  - entity-linking
  - text2text-generation

Model Card for impresso-project/nel-mgenre-multilingual

The Impresso multilingual named entity linking (NEL) model is based on mGENRE (multilingual Generative ENtity REtrieval) proposed by De Cao et al, a sequence-to-sequence architecture for entity disambiguation based on mBART. It uses constrained generation to output entity names mapped to Wikidata/QIDs.

This model was adapted for historical texts and fine-tuned on the HIPE-2022 dataset, which includes a variety of historical document types and languages.

Model Details

Model Description

Model Description

  • Developed by: EPFL from the Impresso team. The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation (CRSII5_173719, CRSII5_213585) and the Luxembourg National Research Fund (grant No. 17498891).
  • Model type: mBART-based sequence-to-sequence model with constrained beam search for named entity linking
  • Languages: Multilingual (100+ languages, optimized for French, German, and English)
  • License: AGPL v3+
  • Finetuned from: facebook/mgenre-wiki

Model Architecture

  • Architecture: mBART-based seq2seq with constrained beam search

Training Details

Training Data

The model was trained on the following datasets:

Dataset alias README Document type Languages Suitable for Project License
ajmc link classical commentaries de, fr, en NERC-Coarse, NERC-Fine, EL AjMC License: CC BY 4.0
hipe2020 link historical newspapers de, fr, en NERC-Coarse, NERC-Fine, EL CLEF-HIPE-2020 License: CC BY-NC-SA 4.0
topres19th link historical newspapers en NERC-Coarse, EL Living with Machines License: CC BY-NC-SA 4.0
newseye link historical newspapers de, fi, fr, sv NERC-Coarse, NERC-Fine, EL NewsEye License: CC BY 4.0
sonar link historical newspapers de NERC-Coarse, EL SoNAR License: CC BY 4.0

How to Use

from transformers import AutoTokenizer, pipeline

NEL_MODEL_NAME = "impresso-project/nel-mgenre-multilingual"
nel_tokenizer = AutoTokenizer.from_pretrained(NEL_MODEL_NAME)

nel_pipeline = pipeline("generic-nel", model=NEL_MODEL_NAME,
                        tokenizer=nel_tokenizer,
                        trust_remote_code=True,
                        device='cpu')

sentence = "Le 0ctobre 1894, [START] Dreyfvs [END] est arrêté à Paris, accusé d'espionnage pour l'Allemagne — un événement qui déch1ra la société fr4nçaise pendant des années."
print(nel_pipeline(sentence))

Output Format

[
    {
        'surface': 'Dreyfvs', 
        'wkd_id': 'Q171826', 
        'wkpedia_pagename': 'Alfred Dreyfus', 
        'wkpedia_url': 'https://fr.wikipedia.org/wiki/Alfred_Dreyfus', 
        'type': 'UNK', 
        'confidence_nel': 99.98, 
        'lOffset': 24, 
        'rOffset': 33}]

The type of the entity is UNK because the model was not trained on the entity type. The confidence_nel score indicates the model's confidence in the prediction.

Use Cases

  • Entity disambiguation in noisy OCR settings
  • Linking historical names to modern Wikidata entities
  • Assisting downstream event extraction and biography generation from historical archives

Limitations

  • Sensitive to tokenisation and malformed spans
  • Accuracy degrades on non-Wikidata entities or in highly ambiguous contexts
  • Focused on historical entity mentions — performance may vary on modern texts

Environmental Impact

  • Hardware: 1x A100 (80GB) for finetuning
  • Training time: ~12 hours
  • Estimated CO₂ Emissions: ~2.3 kg CO₂eq

Contact

Impresso Logo