impresso-project
/

nel-mgenre-multilingual

@@ -1,5 +1,5 @@
 ---
 language:
 - multilingual
 - af
@@ -108,7 +108,7 @@ language:
 - yo
 - zh
 tags:
 - retrieval
 - entity-retrieval
@@ -119,14 +119,24 @@ tags:
 - text2text-generation
 ---
-# mGENRE
-The historical multilingual named entity linking (NEL) model is based on mGENRE (multilingual Generative ENtity REtrieval) system as presented in [Multilingual Autoregressive Entity Linking](https://arxiv.org/abs/2103.12528). mGENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on finetuned [mBART](https://arxiv.org/abs/2001.08210) architecture.
-GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers.
-This model was finetuned on the [HIPE-2022 dataset](https://github.com/hipe-eval/HIPE-2022-data), composed of the following datasets.
 | Dataset alias | README | Document type | Languages |  Suitable for | Project | License |
 |---------|---------|---------------|-----------| ---------------|---------------| ---------------|
@@ -137,43 +147,62 @@ This model was finetuned on the [HIPE-2022 dataset](https://github.com/hipe-eval
 | sonar      | [link](documentation/README-sonar.md) | historical newspapers  | de | NERC-Coarse, EL |  [SoNAR](https://sonar.fh-potsdam.de/)  | [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)|
-## BibTeX entry and citation info
-## Usage
-Here is an example of generation for Wikipedia page disambiguation with simulated OCR noise:
 ```python
-from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
-from transformers import pipeline
 NEL_MODEL_NAME = "impresso-project/nel-mgenre-multilingual"
-# Load the tokenizer and model from the specified pre-trained model name
-# The model used here is "https://huggingface.co/impresso-project/nel-mgenre-multilingual"
-nel_tokenizer = AutoTokenizer.from_pretrained("impresso-project/nel-mgenre-multilingual")
-sentences = ["[START] Un1ted Press [END] - On the h0me fr0nt, the British p0pulace remains steadfast in the f4ce of 0ngoing air raids.",
-             "In [START] Lon6on [END], trotz d3r Zerstörung, ist der Geist der M3nschen ungeb4ochen, mit Freiwilligen und zivilen Verteidigungseinheiten, die unermüdlich arbeiten, um die Kriegsanstrengungen zu unterstützen.",
-             "Les rapports des correspondants de la [START] AFP [END] mettent en lumiére la poussée nationale pour augmenter la production dans les usines, essentielle pour fournir au front les matériaux nécessaires à la victoire."]
-nel_pipeline = pipeline("generic-nel", model=NEL_MODEL_NAME,
-                        tokenizer=nel_tokenizer,
                         trust_remote_code=True,
                         device='cpu')
-for sentence in sentences:
-    print(sentence)
-    linked_entity = nel_pipeline(sentence)
-    print(linked_entity)
-```
 ```
-[{'surface': 'Un1ted Press', 'wkd_id': 'Q493845', 'wkpedia_pagename': 'United Press International', 'wkpedia_url': 'https://en.wikipedia.org/wiki/United_Press_International', 'type': 'UNK', 'confidence_nel': 55.89, 'lOffset': 7, 'rOffset': 21}]
-[{'surface': 'Lon6on', 'wkd_id': 'Q84', 'wkpedia_pagename': 'London', 'wkpedia_url': 'https://de.wikipedia.org/wiki/London', 'type': 'UNK', 'confidence_nel': 99.99, 'lOffset': 10, 'rOffset': 18}]
-[{'surface': 'AFP', 'wkd_id': 'Q40464', 'wkpedia_pagename': 'Agence France-Presse', 'wkpedia_url': 'https://fr.wikipedia.org/wiki/Agence_France-Presse', 'type': 'UNK', 'confidence_nel': 100.0, 'lOffset': 45, 'rOffset': 50}]
 ```
----
-license: agpl-3.0
----

 ---
+library_name: transformers
 language:
 - multilingual
 - af
 - yo
 - zh
+license: agpl-3.0
 tags:
 - retrieval
 - entity-retrieval
 - text2text-generation
 ---
+# Model Card for `impresso-project/nel-mgenre-multilingual`
+The **Impresso multilingual named entity linking (NEL)** model is based on **mGENRE** (multilingual Generative ENtity REtrieval) proposed by [De Cao et al](https://arxiv.org/abs/2103.12528), a sequence-to-sequence architecture for entity disambiguation based on [mBART](https://arxiv.org/abs/2001.08210). It uses **constrained generation** to output entity names mapped to Wikidata/QIDs.
+This model was adapted for historical texts and fine-tuned on the [HIPE-2022 dataset](https://github.com/hipe-eval/HIPE-2022-data), which includes a variety of historical document types and languages.
+## Model Details
+- **Architecture:** mBART-based seq2seq with constrained beam search
+- **Languages supported:** multilingual (over 100 languages, optimized for fr, de, en)
+- **Training dataset:** HIPE-2022 (see below)
+- **Entity target space:** Wikidata entities
+- **Developed by:** DHLAB, EPFL
+- **License:** AGPL-3.0
+## Training Dataset
+The model was trained on the following datasets:
 | Dataset alias | README | Document type | Languages |  Suitable for | Project | License |
 |---------|---------|---------------|-----------| ---------------|---------------| ---------------|
 | sonar      | [link](documentation/README-sonar.md) | historical newspapers  | de | NERC-Coarse, EL |  [SoNAR](https://sonar.fh-potsdam.de/)  | [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)|
+## How to Use
 ```python
+from transformers import AutoTokenizer, pipeline
 NEL_MODEL_NAME = "impresso-project/nel-mgenre-multilingual"
+nel_tokenizer = AutoTokenizer.from_pretrained(NEL_MODEL_NAME)
+nel_pipeline = pipeline("generic-nel", model=NEL_MODEL_NAME,
+                        tokenizer=nel_tokenizer,
                         trust_remote_code=True,
                         device='cpu')
+sentence = "Le 0ctobre 1894, [START] Dreyfvs [END] est arrêté à Paris, accusé d'espionnage pour l'Allemagne — un événement qui déch1ra la société fr4nçaise pendant des années."
+print(nel_pipeline(sentence))
 ```
+### Output Format
+```python
+[
+    {
+        'surface': 'Dreyfvs',
+        'wkd_id': 'Q171826',
+        'wkpedia_pagename': 'Alfred Dreyfus',
+        'wkpedia_url': 'https://fr.wikipedia.org/wiki/Alfred_Dreyfus',
+        'type': 'UNK',
+        'confidence_nel': 99.98,
+        'lOffset': 24,
+        'rOffset': 33}]
 ```
+The type of the entity is `UNK` because the model was not trained on the entity type. The `confidence_nel` score indicates the model's confidence in the prediction.
+## Use Cases
+- Entity disambiguation in noisy OCR settings
+- Linking historical names to modern Wikidata entities
+- Assisting downstream event extraction and biography generation from historical archives
+## Limitations
+- Sensitive to tokenisation and malformed spans
+- Accuracy degrades on non-Wikidata entities or in highly ambiguous contexts
+- Focused on historical entity mentions — performance may vary on modern texts
+## Environmental Impact
+- **Hardware:** 1x A100 (80GB) for finetuning
+- **Training time:** ~12 hours
+- **Estimated CO₂ Emissions:** ~2.3 kg CO₂eq
+## Contact
+- Website: [https://impresso-project.ch](https://impresso-project.ch)
+<p align="center">
+  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/>
+</p>