review readme

e36effa 7 days ago

6.2 kB

	---
	library_name: transformers
	language:
	- multilingual
	- af
	- am
	- ar
	- as
	- az
	- be
	- bg
	- bm
	- bn
	- br
	- bs
	- ca
	- cs
	- cy
	- da
	- de
	- el
	- en
	- eo
	- es
	- et
	- eu
	- fa
	- ff
	- fi
	- fr
	- fy
	- ga
	- gd
	- gl
	- gn
	- gu
	- ha
	- he
	- hi
	- hr
	- ht
	- hu
	- hy
	- id
	- ig
	- is
	- it
	- ja
	- jv
	- ka
	- kg
	- kk
	- km
	- kn
	- ko
	- ku
	- ky
	- la
	- lg
	- ln
	- lo
	- lt
	- lv
	- mg
	- mk
	- ml
	- mn
	- mr
	- ms
	- my
	- ne
	- nl
	- no
	- om
	- or
	- pa
	- pl
	- ps
	- pt
	- qu
	- ro
	- ru
	- sa
	- sd
	- si
	- sk
	- sl
	- so
	- sq
	- sr
	- ss
	- su
	- sv
	- sw
	- ta
	- te
	- th
	- ti
	- tl
	- tn
	- tr
	- uk
	- ur
	- uz
	- vi
	- wo
	- xh
	- yo
	- zh

	license: agpl-3.0
	tags:
	- retrieval
	- entity-retrieval
	- named-entity-disambiguation
	- entity-disambiguation
	- named-entity-linking
	- entity-linking
	- text2text-generation
	---

	# Model Card for `impresso-project/nel-mgenre-multilingual`

	The Impresso multilingual named entity linking (NEL) model is based on mGENRE (multilingual Generative ENtity REtrieval) proposed by [De Cao et al](https://arxiv.org/abs/2103.12528), a sequence-to-sequence architecture for entity disambiguation based on [mBART](https://arxiv.org/abs/2001.08210). It uses constrained generation to output entity names mapped to Wikidata/QIDs.

	This model was adapted for historical texts and fine-tuned on the [HIPE-2022 dataset](https://github.com/hipe-eval/HIPE-2022-data), which includes a variety of historical document types and languages.

	## Model Details

	### Model Description

	### Model Description

	- Developed by: EPFL from the [Impresso team](https://impresso-project.ch). The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation ([CRSII5_173719](http://p3.snf.ch/project-173719), [CRSII5_213585](https://data.snf.ch/grants/grant/213585)) and the Luxembourg National Research Fund (grant No. 17498891).
	- Model type: mBART-based sequence-to-sequence model with constrained beam search for named entity linking
	- Languages: Multilingual (100+ languages, optimized for French, German, and English)
	- License: [AGPL v3+](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE)
	- Finetuned from: [`facebook/mgenre-wiki`](https://huggingface.co/facebook/mgenre-wiki)
	-
	### Model Architecture

	- Architecture: mBART-based seq2seq with constrained beam search

	## Training Details

	### Training Data

	The model was trained on the following datasets:

	\| Dataset alias \| README \| Document type \| Languages \| Suitable for \| Project \| License \|
	\|---------\|---------\|---------------\|-----------\| ---------------\|---------------\| ---------------\|
	\| ajmc \| [link](documentation/README-ajmc.md) \| classical commentaries \| de, fr, en \| NERC-Coarse, NERC-Fine, EL \| [AjMC](https://mromanello.github.io/ajax-multi-commentary/) \| [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/) \|
	\| hipe2020 \| [link](documentation/README-hipe2020.md)\| historical newspapers \| de, fr, en \| NERC-Coarse, NERC-Fine, EL \| [CLEF-HIPE-2020](https://impresso.github.io/CLEF-HIPE-2020)\| [![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC_BY--NC--SA_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)\|
	\| topres19th \| [link](documentation/README-topres19th.md) \| historical newspapers \| en \| NERC-Coarse, EL \|[Living with Machines](https://livingwithmachines.ac.uk/) \| [![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC_BY--NC--SA_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)\|
	\| newseye \| [link](documentation/README-newseye.md)\| historical newspapers \| de, fi, fr, sv \| NERC-Coarse, NERC-Fine, EL \| [NewsEye](https://www.newseye.eu/) \| [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)\|
	\| sonar \| [link](documentation/README-sonar.md) \| historical newspapers \| de \| NERC-Coarse, EL \| [SoNAR](https://sonar.fh-potsdam.de/) \| [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)\|


	## How to Use

	```python
	from transformers import AutoTokenizer, pipeline

	NEL_MODEL_NAME = "impresso-project/nel-mgenre-multilingual"
	nel_tokenizer = AutoTokenizer.from_pretrained(NEL_MODEL_NAME)

	nel_pipeline = pipeline("generic-nel", model=NEL_MODEL_NAME,
	tokenizer=nel_tokenizer,
	trust_remote_code=True,
	device='cpu')

	sentence = "Le 0ctobre 1894, [START] Dreyfvs [END] est arrêté à Paris, accusé d'espionnage pour l'Allemagne — un événement qui déch1ra la société fr4nçaise pendant des années."
	print(nel_pipeline(sentence))
	```

	### Output Format

	```python
	[
	{
	'surface': 'Dreyfvs',
	'wkd_id': 'Q171826',
	'wkpedia_pagename': 'Alfred Dreyfus',
	'wkpedia_url': 'https://fr.wikipedia.org/wiki/Alfred_Dreyfus',
	'type': 'UNK',
	'confidence_nel': 99.98,
	'lOffset': 24,
	'rOffset': 33}]
	```
	The type of the entity is `UNK` because the model was not trained on the entity type. The `confidence_nel` score indicates the model's confidence in the prediction.

	## Use Cases

	- Entity disambiguation in noisy OCR settings
	- Linking historical names to modern Wikidata entities
	- Assisting downstream event extraction and biography generation from historical archives

	## Limitations

	- Sensitive to tokenisation and malformed spans
	- Accuracy degrades on non-Wikidata entities or in highly ambiguous contexts
	- Focused on historical entity mentions — performance may vary on modern texts

	## Environmental Impact

	- Hardware: 1x A100 (80GB) for finetuning
	- Training time: ~12 hours
	- Estimated CO₂ Emissions: ~2.3 kg CO₂eq

	## Contact

	- Website: [https://impresso-project.ch](https://impresso-project.ch)

	<p align="center">
	<img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/>
	</p>