emanuelaboros commited on
Commit
1b11449
·
1 Parent(s): c8f5136

review readme

Browse files
Files changed (1) hide show
  1. README.md +63 -34
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
-
3
  language:
4
  - multilingual
5
  - af
@@ -108,7 +108,7 @@ language:
108
  - yo
109
  - zh
110
 
111
-
112
  tags:
113
  - retrieval
114
  - entity-retrieval
@@ -119,14 +119,24 @@ tags:
119
  - text2text-generation
120
  ---
121
 
 
122
 
123
- # mGENRE
124
 
 
125
 
126
- The historical multilingual named entity linking (NEL) model is based on mGENRE (multilingual Generative ENtity REtrieval) system as presented in [Multilingual Autoregressive Entity Linking](https://arxiv.org/abs/2103.12528). mGENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on finetuned [mBART](https://arxiv.org/abs/2001.08210) architecture.
127
- GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers.
128
 
129
- This model was finetuned on the [HIPE-2022 dataset](https://github.com/hipe-eval/HIPE-2022-data), composed of the following datasets.
 
 
 
 
 
 
 
 
 
130
 
131
  | Dataset alias | README | Document type | Languages | Suitable for | Project | License |
132
  |---------|---------|---------------|-----------| ---------------|---------------| ---------------|
@@ -137,43 +147,62 @@ This model was finetuned on the [HIPE-2022 dataset](https://github.com/hipe-eval
137
  | sonar | [link](documentation/README-sonar.md) | historical newspapers | de | NERC-Coarse, EL | [SoNAR](https://sonar.fh-potsdam.de/) | [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)|
138
 
139
 
140
- ## BibTeX entry and citation info
141
-
142
 
143
- ## Usage
144
-
145
- Here is an example of generation for Wikipedia page disambiguation with simulated OCR noise:
146
  ```python
147
- from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
148
- from transformers import pipeline
149
 
150
  NEL_MODEL_NAME = "impresso-project/nel-mgenre-multilingual"
 
151
 
152
- # Load the tokenizer and model from the specified pre-trained model name
153
- # The model used here is "https://huggingface.co/impresso-project/nel-mgenre-multilingual"
154
- nel_tokenizer = AutoTokenizer.from_pretrained("impresso-project/nel-mgenre-multilingual")
155
-
156
- sentences = ["[START] Un1ted Press [END] - On the h0me fr0nt, the British p0pulace remains steadfast in the f4ce of 0ngoing air raids.",
157
- "In [START] Lon6on [END], trotz d3r Zerstörung, ist der Geist der M3nschen ungeb4ochen, mit Freiwilligen und zivilen Verteidigungseinheiten, die unermüdlich arbeiten, um die Kriegsanstrengungen zu unterstützen.",
158
- "Les rapports des correspondants de la [START] AFP [END] mettent en lumiére la poussée nationale pour augmenter la production dans les usines, essentielle pour fournir au front les matériaux nécessaires à la victoire."]
159
-
160
- nel_pipeline = pipeline("generic-nel", model=NEL_MODEL_NAME,
161
- tokenizer=nel_tokenizer,
162
  trust_remote_code=True,
163
  device='cpu')
164
- for sentence in sentences:
165
- print(sentence)
166
- linked_entity = nel_pipeline(sentence)
167
- print(linked_entity)
168
- ```
169
 
 
 
170
  ```
171
- [{'surface': 'Un1ted Press', 'wkd_id': 'Q493845', 'wkpedia_pagename': 'United Press International', 'wkpedia_url': 'https://en.wikipedia.org/wiki/United_Press_International', 'type': 'UNK', 'confidence_nel': 55.89, 'lOffset': 7, 'rOffset': 21}]
172
- [{'surface': 'Lon6on', 'wkd_id': 'Q84', 'wkpedia_pagename': 'London', 'wkpedia_url': 'https://de.wikipedia.org/wiki/London', 'type': 'UNK', 'confidence_nel': 99.99, 'lOffset': 10, 'rOffset': 18}]
173
- [{'surface': 'AFP', 'wkd_id': 'Q40464', 'wkpedia_pagename': 'Agence France-Presse', 'wkpedia_url': 'https://fr.wikipedia.org/wiki/Agence_France-Presse', 'type': 'UNK', 'confidence_nel': 100.0, 'lOffset': 45, 'rOffset': 50}]
 
 
 
 
 
 
 
 
 
 
 
174
  ```
 
175
 
176
- ---
177
- license: agpl-3.0
178
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
179
 
 
1
  ---
2
+ library_name: transformers
3
  language:
4
  - multilingual
5
  - af
 
108
  - yo
109
  - zh
110
 
111
+ license: agpl-3.0
112
  tags:
113
  - retrieval
114
  - entity-retrieval
 
119
  - text2text-generation
120
  ---
121
 
122
+ # Model Card for `impresso-project/nel-mgenre-multilingual`
123
 
124
+ The **Impresso multilingual named entity linking (NEL)** model is based on **mGENRE** (multilingual Generative ENtity REtrieval) proposed by [De Cao et al](https://arxiv.org/abs/2103.12528), a sequence-to-sequence architecture for entity disambiguation based on [mBART](https://arxiv.org/abs/2001.08210). It uses **constrained generation** to output entity names mapped to Wikidata/QIDs.
125
 
126
+ This model was adapted for historical texts and fine-tuned on the [HIPE-2022 dataset](https://github.com/hipe-eval/HIPE-2022-data), which includes a variety of historical document types and languages.
127
 
128
+ ## Model Details
 
129
 
130
+ - **Architecture:** mBART-based seq2seq with constrained beam search
131
+ - **Languages supported:** multilingual (over 100 languages, optimized for fr, de, en)
132
+ - **Training dataset:** HIPE-2022 (see below)
133
+ - **Entity target space:** Wikidata entities
134
+ - **Developed by:** DHLAB, EPFL
135
+ - **License:** AGPL-3.0
136
+
137
+ ## Training Dataset
138
+
139
+ The model was trained on the following datasets:
140
 
141
  | Dataset alias | README | Document type | Languages | Suitable for | Project | License |
142
  |---------|---------|---------------|-----------| ---------------|---------------| ---------------|
 
147
  | sonar | [link](documentation/README-sonar.md) | historical newspapers | de | NERC-Coarse, EL | [SoNAR](https://sonar.fh-potsdam.de/) | [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)|
148
 
149
 
150
+ ## How to Use
 
151
 
 
 
 
152
  ```python
153
+ from transformers import AutoTokenizer, pipeline
 
154
 
155
  NEL_MODEL_NAME = "impresso-project/nel-mgenre-multilingual"
156
+ nel_tokenizer = AutoTokenizer.from_pretrained(NEL_MODEL_NAME)
157
 
158
+ nel_pipeline = pipeline("generic-nel", model=NEL_MODEL_NAME,
159
+ tokenizer=nel_tokenizer,
 
 
 
 
 
 
 
 
160
  trust_remote_code=True,
161
  device='cpu')
 
 
 
 
 
162
 
163
+ sentence = "Le 0ctobre 1894, [START] Dreyfvs [END] est arrêté à Paris, accusé d'espionnage pour l'Allemagne — un événement qui déch1ra la société fr4nçaise pendant des années."
164
+ print(nel_pipeline(sentence))
165
  ```
166
+
167
+ ### Output Format
168
+
169
+ ```python
170
+ [
171
+ {
172
+ 'surface': 'Dreyfvs',
173
+ 'wkd_id': 'Q171826',
174
+ 'wkpedia_pagename': 'Alfred Dreyfus',
175
+ 'wkpedia_url': 'https://fr.wikipedia.org/wiki/Alfred_Dreyfus',
176
+ 'type': 'UNK',
177
+ 'confidence_nel': 99.98,
178
+ 'lOffset': 24,
179
+ 'rOffset': 33}]
180
  ```
181
+ The type of the entity is `UNK` because the model was not trained on the entity type. The `confidence_nel` score indicates the model's confidence in the prediction.
182
 
183
+ ## Use Cases
184
+
185
+ - Entity disambiguation in noisy OCR settings
186
+ - Linking historical names to modern Wikidata entities
187
+ - Assisting downstream event extraction and biography generation from historical archives
188
+
189
+ ## Limitations
190
+
191
+ - Sensitive to tokenisation and malformed spans
192
+ - Accuracy degrades on non-Wikidata entities or in highly ambiguous contexts
193
+ - Focused on historical entity mentions — performance may vary on modern texts
194
+
195
+ ## Environmental Impact
196
+
197
+ - **Hardware:** 1x A100 (80GB) for finetuning
198
+ - **Training time:** ~12 hours
199
+ - **Estimated CO₂ Emissions:** ~2.3 kg CO₂eq
200
+
201
+ ## Contact
202
+
203
+ - Website: [https://impresso-project.ch](https://impresso-project.ch)
204
+
205
+ <p align="center">
206
+ <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/>
207
+ </p>
208