metadata

library_name: transformers
language:
  - multilingual
  - af
  - am
  - ar
  - as
  - az
  - be
  - bg
  - bm
  - bn
  - br
  - bs
  - ca
  - cs
  - cy
  - da
  - de
  - el
  - en
  - eo
  - es
  - et
  - eu
  - fa
  - ff
  - fi
  - fr
  - fy
  - ga
  - gd
  - gl
  - gn
  - gu
  - ha
  - he
  - hi
  - hr
  - ht
  - hu
  - hy
  - id
  - ig
  - is
  - it
  - ja
  - jv
  - ka
  - kg
  - kk
  - km
  - kn
  - ko
  - ku
  - ky
  - la
  - lg
  - ln
  - lo
  - lt
  - lv
  - mg
  - mk
  - ml
  - mn
  - mr
  - ms
  - my
  - ne
  - nl
  - 'no'
  - om
  - or
  - pa
  - pl
  - ps
  - pt
  - qu
  - ro
  - ru
  - sa
  - sd
  - si
  - sk
  - sl
  - so
  - sq
  - sr
  - ss
  - su
  - sv
  - sw
  - ta
  - te
  - th
  - ti
  - tl
  - tn
  - tr
  - uk
  - ur
  - uz
  - vi
  - wo
  - xh
  - yo
  - zh
license: agpl-3.0
tags:
  - retrieval
  - entity-retrieval
  - named-entity-disambiguation
  - entity-disambiguation
  - named-entity-linking
  - entity-linking
  - text2text-generation

Model Card for `impresso-project/nel-mgenre-multilingual`

The Impresso multilingual named entity linking (NEL) model is based on mGENRE (multilingual Generative ENtity REtrieval) proposed by De Cao et al, a sequence-to-sequence architecture for entity disambiguation based on mBART. It uses constrained generation to output entity names mapped to Wikidata/QIDs.

This model was adapted for historical texts and fine-tuned on the HIPE-2022 dataset, which includes a variety of historical document types and languages.

Model Details

Model Description

Developed by: EPFL from the Impresso team. The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation (CRSII5_173719, CRSII5_213585) and the Luxembourg National Research Fund (grant No. 17498891).
Model type: mBART-based sequence-to-sequence model with constrained beam search for named entity linking
Languages: Multilingual (100+ languages, optimized for French, German, and English)
License: AGPL v3+
Finetuned from: facebook/mgenre-wiki

Model Architecture

Architecture: mBART-based seq2seq with constrained beam search

Training Details

Training Data

The model was trained on the following datasets:

Dataset alias	README	Document type	Languages	Suitable for	Project
ajmc	link	classical commentaries	de, fr, en	NERC-Coarse, NERC-Fine, EL	AjMC
hipe2020	link	historical newspapers	de, fr, en	NERC-Coarse, NERC-Fine, EL	CLEF-HIPE-2020
topres19th	link	historical newspapers	en	NERC-Coarse, EL	Living with Machines
newseye	link	historical newspapers	de, fi, fr, sv	NERC-Coarse, NERC-Fine, EL	NewsEye
sonar	link	historical newspapers	de	NERC-Coarse, EL	SoNAR

How to Use

from transformers import AutoTokenizer, pipeline

NEL_MODEL_NAME = "impresso-project/nel-mgenre-multilingual"
nel_tokenizer = AutoTokenizer.from_pretrained(NEL_MODEL_NAME)

nel_pipeline = pipeline("generic-nel", model=NEL_MODEL_NAME,
                        tokenizer=nel_tokenizer,
                        trust_remote_code=True,
                        device='cpu')

sentence = "Le 0ctobre 1894, [START] Dreyfvs [END] est arrêté à Paris, accusé d'espionnage pour l'Allemagne — un événement qui déch1ra la société fr4nçaise pendant des années."
print(nel_pipeline(sentence))

Output Format

[
    {
        'surface': 'Dreyfvs', 
        'wkd_id': 'Q171826', 
        'wkpedia_pagename': 'Alfred Dreyfus', 
        'wkpedia_url': 'https://fr.wikipedia.org/wiki/Alfred_Dreyfus', 
        'type': 'UNK', 
        'confidence_nel': 99.98, 
        'lOffset': 24, 
        'rOffset': 33}]

The type of the entity is UNK because the model was not trained on the entity type. The confidence_nel score indicates the model's confidence in the prediction.

Use Cases

Entity disambiguation in noisy OCR settings
Linking historical names to modern Wikidata entities
Assisting downstream event extraction and biography generation from historical archives

Limitations

Sensitive to tokenisation and malformed spans
Accuracy degrades on non-Wikidata entities or in highly ambiguous contexts
Focused on historical entity mentions — performance may vary on modern texts

Environmental Impact

Hardware: 1x A100 (80GB) for finetuning
Training time: ~12 hours
Estimated CO₂ Emissions: ~2.3 kg CO₂eq

Contact

Website: https://impresso-project.ch

Impresso Logo

Model Card for impresso-project/nel-mgenre-multilingual

Model Details

Model Description

Model Description

Model Architecture

Training Details

Training Data

How to Use

Output Format

Use Cases

Limitations

Environmental Impact

Contact

Model Card for `impresso-project/nel-mgenre-multilingual`