Commit
·
1b11449
1
Parent(s):
c8f5136
review readme
Browse files
README.md
CHANGED
@@ -1,5 +1,5 @@
|
|
1 |
---
|
2 |
-
|
3 |
language:
|
4 |
- multilingual
|
5 |
- af
|
@@ -108,7 +108,7 @@ language:
|
|
108 |
- yo
|
109 |
- zh
|
110 |
|
111 |
-
|
112 |
tags:
|
113 |
- retrieval
|
114 |
- entity-retrieval
|
@@ -119,14 +119,24 @@ tags:
|
|
119 |
- text2text-generation
|
120 |
---
|
121 |
|
|
|
122 |
|
123 |
-
|
124 |
|
|
|
125 |
|
126 |
-
|
127 |
-
GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers.
|
128 |
|
129 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
130 |
|
131 |
| Dataset alias | README | Document type | Languages | Suitable for | Project | License |
|
132 |
|---------|---------|---------------|-----------| ---------------|---------------| ---------------|
|
@@ -137,43 +147,62 @@ This model was finetuned on the [HIPE-2022 dataset](https://github.com/hipe-eval
|
|
137 |
| sonar | [link](documentation/README-sonar.md) | historical newspapers | de | NERC-Coarse, EL | [SoNAR](https://sonar.fh-potsdam.de/) | [](https://creativecommons.org/licenses/by/4.0/)|
|
138 |
|
139 |
|
140 |
-
##
|
141 |
-
|
142 |
|
143 |
-
## Usage
|
144 |
-
|
145 |
-
Here is an example of generation for Wikipedia page disambiguation with simulated OCR noise:
|
146 |
```python
|
147 |
-
from transformers import AutoTokenizer,
|
148 |
-
from transformers import pipeline
|
149 |
|
150 |
NEL_MODEL_NAME = "impresso-project/nel-mgenre-multilingual"
|
|
|
151 |
|
152 |
-
|
153 |
-
|
154 |
-
nel_tokenizer = AutoTokenizer.from_pretrained("impresso-project/nel-mgenre-multilingual")
|
155 |
-
|
156 |
-
sentences = ["[START] Un1ted Press [END] - On the h0me fr0nt, the British p0pulace remains steadfast in the f4ce of 0ngoing air raids.",
|
157 |
-
"In [START] Lon6on [END], trotz d3r Zerstörung, ist der Geist der M3nschen ungeb4ochen, mit Freiwilligen und zivilen Verteidigungseinheiten, die unermüdlich arbeiten, um die Kriegsanstrengungen zu unterstützen.",
|
158 |
-
"Les rapports des correspondants de la [START] AFP [END] mettent en lumiére la poussée nationale pour augmenter la production dans les usines, essentielle pour fournir au front les matériaux nécessaires à la victoire."]
|
159 |
-
|
160 |
-
nel_pipeline = pipeline("generic-nel", model=NEL_MODEL_NAME,
|
161 |
-
tokenizer=nel_tokenizer,
|
162 |
trust_remote_code=True,
|
163 |
device='cpu')
|
164 |
-
for sentence in sentences:
|
165 |
-
print(sentence)
|
166 |
-
linked_entity = nel_pipeline(sentence)
|
167 |
-
print(linked_entity)
|
168 |
-
```
|
169 |
|
|
|
|
|
170 |
```
|
171 |
-
|
172 |
-
|
173 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
174 |
```
|
|
|
175 |
|
176 |
-
|
177 |
-
|
178 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
179 |
|
|
|
1 |
---
|
2 |
+
library_name: transformers
|
3 |
language:
|
4 |
- multilingual
|
5 |
- af
|
|
|
108 |
- yo
|
109 |
- zh
|
110 |
|
111 |
+
license: agpl-3.0
|
112 |
tags:
|
113 |
- retrieval
|
114 |
- entity-retrieval
|
|
|
119 |
- text2text-generation
|
120 |
---
|
121 |
|
122 |
+
# Model Card for `impresso-project/nel-mgenre-multilingual`
|
123 |
|
124 |
+
The **Impresso multilingual named entity linking (NEL)** model is based on **mGENRE** (multilingual Generative ENtity REtrieval) proposed by [De Cao et al](https://arxiv.org/abs/2103.12528), a sequence-to-sequence architecture for entity disambiguation based on [mBART](https://arxiv.org/abs/2001.08210). It uses **constrained generation** to output entity names mapped to Wikidata/QIDs.
|
125 |
|
126 |
+
This model was adapted for historical texts and fine-tuned on the [HIPE-2022 dataset](https://github.com/hipe-eval/HIPE-2022-data), which includes a variety of historical document types and languages.
|
127 |
|
128 |
+
## Model Details
|
|
|
129 |
|
130 |
+
- **Architecture:** mBART-based seq2seq with constrained beam search
|
131 |
+
- **Languages supported:** multilingual (over 100 languages, optimized for fr, de, en)
|
132 |
+
- **Training dataset:** HIPE-2022 (see below)
|
133 |
+
- **Entity target space:** Wikidata entities
|
134 |
+
- **Developed by:** DHLAB, EPFL
|
135 |
+
- **License:** AGPL-3.0
|
136 |
+
|
137 |
+
## Training Dataset
|
138 |
+
|
139 |
+
The model was trained on the following datasets:
|
140 |
|
141 |
| Dataset alias | README | Document type | Languages | Suitable for | Project | License |
|
142 |
|---------|---------|---------------|-----------| ---------------|---------------| ---------------|
|
|
|
147 |
| sonar | [link](documentation/README-sonar.md) | historical newspapers | de | NERC-Coarse, EL | [SoNAR](https://sonar.fh-potsdam.de/) | [](https://creativecommons.org/licenses/by/4.0/)|
|
148 |
|
149 |
|
150 |
+
## How to Use
|
|
|
151 |
|
|
|
|
|
|
|
152 |
```python
|
153 |
+
from transformers import AutoTokenizer, pipeline
|
|
|
154 |
|
155 |
NEL_MODEL_NAME = "impresso-project/nel-mgenre-multilingual"
|
156 |
+
nel_tokenizer = AutoTokenizer.from_pretrained(NEL_MODEL_NAME)
|
157 |
|
158 |
+
nel_pipeline = pipeline("generic-nel", model=NEL_MODEL_NAME,
|
159 |
+
tokenizer=nel_tokenizer,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
160 |
trust_remote_code=True,
|
161 |
device='cpu')
|
|
|
|
|
|
|
|
|
|
|
162 |
|
163 |
+
sentence = "Le 0ctobre 1894, [START] Dreyfvs [END] est arrêté à Paris, accusé d'espionnage pour l'Allemagne — un événement qui déch1ra la société fr4nçaise pendant des années."
|
164 |
+
print(nel_pipeline(sentence))
|
165 |
```
|
166 |
+
|
167 |
+
### Output Format
|
168 |
+
|
169 |
+
```python
|
170 |
+
[
|
171 |
+
{
|
172 |
+
'surface': 'Dreyfvs',
|
173 |
+
'wkd_id': 'Q171826',
|
174 |
+
'wkpedia_pagename': 'Alfred Dreyfus',
|
175 |
+
'wkpedia_url': 'https://fr.wikipedia.org/wiki/Alfred_Dreyfus',
|
176 |
+
'type': 'UNK',
|
177 |
+
'confidence_nel': 99.98,
|
178 |
+
'lOffset': 24,
|
179 |
+
'rOffset': 33}]
|
180 |
```
|
181 |
+
The type of the entity is `UNK` because the model was not trained on the entity type. The `confidence_nel` score indicates the model's confidence in the prediction.
|
182 |
|
183 |
+
## Use Cases
|
184 |
+
|
185 |
+
- Entity disambiguation in noisy OCR settings
|
186 |
+
- Linking historical names to modern Wikidata entities
|
187 |
+
- Assisting downstream event extraction and biography generation from historical archives
|
188 |
+
|
189 |
+
## Limitations
|
190 |
+
|
191 |
+
- Sensitive to tokenisation and malformed spans
|
192 |
+
- Accuracy degrades on non-Wikidata entities or in highly ambiguous contexts
|
193 |
+
- Focused on historical entity mentions — performance may vary on modern texts
|
194 |
+
|
195 |
+
## Environmental Impact
|
196 |
+
|
197 |
+
- **Hardware:** 1x A100 (80GB) for finetuning
|
198 |
+
- **Training time:** ~12 hours
|
199 |
+
- **Estimated CO₂ Emissions:** ~2.3 kg CO₂eq
|
200 |
+
|
201 |
+
## Contact
|
202 |
+
|
203 |
+
- Website: [https://impresso-project.ch](https://impresso-project.ch)
|
204 |
+
|
205 |
+
<p align="center">
|
206 |
+
<img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/>
|
207 |
+
</p>
|
208 |
|