|
--- |
|
license: apache-2.0 |
|
tags: |
|
- flair |
|
- token-classification |
|
- sequence-tagger-model |
|
language: es |
|
datasets: |
|
- conll2003 |
|
- BSC-LT/NextProcurement-NER-Spanish-UTE-Company-annotated |
|
widget: |
|
- text: "PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRÁULICOS S.L., constituidos en UTE PODACESA-ECR realizan la siguiente oferta:" |
|
- text: "PODACESA OBRAS Y SERVICIOS, S.A realiza la siguiente oferta:" |
|
--- |
|
|
|
# Recognition of UTEs and company mentions in Flair |
|
|
|
This is a model trained using [Flair](https://github.com/flairNLP/flair/) to recognise mentions of UTEs (Unión Temporal de Empresas) |
|
and companies in public tenders. |
|
|
|
It is a finetune of the flair/ner-spanish-large model (retrained from scratch to include additional tags). |
|
|
|
Based on document-level XLM-R embeddings and [FLERT](https://arxiv.org/pdf/2011.06993v1.pdf/). |
|
|
|
|
|
## Demo: How to use in Flair |
|
|
|
Requires: **[Flair](https://github.com/flairNLP/flair/)** (`pip install flair`) |
|
|
|
```python |
|
from flair.data import Sentence |
|
from flair.models import SequenceTagger |
|
# load tagger |
|
tagger = SequenceTagger.load("BSC-LT/NextProcurement-NER-Spanish-UTE-Company") |
|
# make example sentence |
|
sentence = Sentence("PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRÁULICOS S.L., constituidos en UTE PODACESA-ECR realizan la siguiente oferta:") |
|
# predict NER tags |
|
tagger.predict(sentence) |
|
# print sentence |
|
print(sentence) |
|
# print predicted NER spans |
|
print('The following NER tags are found:') |
|
# iterate over entities and print |
|
for entity in sentence.get_spans('ner'): |
|
print(entity) |
|
``` |
|
|
|
This yields the following output: |
|
``` |
|
Sentence[24]: "PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRAULICOS S.L., constituidos en UTE PODACESA-ECR realizan la siguiente oferta:" _ ["PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRAULICOS S.L."/UTE, "PODACESA-ECR"/UTE] |
|
The following NER tags are found: |
|
Span[0:14]: "PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRAULICOS S.L." _ UTE (0.995) |
|
Span[18:19]: "PODACESA-ECR" _ UTE (0.9955) |
|
``` |
|
|
|
and with the sentence "PODACESA OBRAS Y SERVICIOS, S.A realiza la siguiente oferta:" |
|
``` |
|
Sentence[11]: "PODACESA OBRAS Y SERVICIOS, S.A realiza la siguiente oferta:" _ ["PODACESA OBRAS Y SERVICIOS, S.A"/SINGLE_COMPANY] |
|
The following NER tags are found: |
|
Span[0:6]: "PODACESA OBRAS Y SERVICIOS, S.A" _ SINGLE_COMPANY (1.0) |
|
``` |
|
|
|
|
|
## Training: Script to train this model |
|
|
|
The following Flair script was used to train this model (**TODO: update**): |
|
|
|
```python |
|
import torch |
|
# 1. get the corpus |
|
from flair.datasets import CONLL_03_SPANISH |
|
corpus = CONLL_03_SPANISH() |
|
# 2. what tag do we want to predict? |
|
tag_type = 'ner' |
|
# 3. make the tag dictionary from the corpus |
|
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type) |
|
# 4. initialize fine-tuneable transformer embeddings WITH document context |
|
from flair.embeddings import TransformerWordEmbeddings |
|
embeddings = TransformerWordEmbeddings( |
|
model='xlm-roberta-large', |
|
layers="-1", |
|
subtoken_pooling="first", |
|
fine_tune=True, |
|
use_context=True, |
|
) |
|
# 5. initialize bare-bones sequence tagger (no CRF, no RNN, no reprojection) |
|
from flair.models import SequenceTagger |
|
tagger = SequenceTagger( |
|
hidden_size=256, |
|
embeddings=embeddings, |
|
tag_dictionary=tag_dictionary, |
|
tag_type='ner', |
|
use_crf=False, |
|
use_rnn=False, |
|
reproject_embeddings=False, |
|
) |
|
# 6. initialize trainer with AdamW optimizer |
|
from flair.trainers import ModelTrainer |
|
trainer = ModelTrainer(tagger, corpus, optimizer=torch.optim.AdamW) |
|
# 7. run training with XLM parameters (20 epochs, small LR) |
|
from torch.optim.lr_scheduler import OneCycleLR |
|
trainer.train('resources/taggers/ner-spanish-large', |
|
learning_rate=5.0e-6, |
|
mini_batch_size=4, |
|
mini_batch_chunk_size=1, |
|
max_epochs=20, |
|
scheduler=OneCycleLR, |
|
embeddings_storage_mode='none', |
|
weight_decay=0., |
|
) |
|
) |
|
``` |
|
|
|
## Evaluation Results |
|
|
|
``` |
|
Results: |
|
- F-score (micro) 0.7431 |
|
- F-score (macro) 0.7429 |
|
- Accuracy 0.5944 |
|
|
|
By class: |
|
precision recall f1-score support |
|
|
|
UTE 0.7568 0.7887 0.7724 71 |
|
SINGLE_COMPANY 0.6538 0.7846 0.7133 65 |
|
|
|
micro avg 0.7039 0.7868 0.7431 136 |
|
macro avg 0.7053 0.7867 0.7429 136 |
|
weighted avg 0.7076 0.7868 0.7442 136 |
|
``` |
|
|
|
## Additional information |
|
|
|
### Author |
|
The Language Technologies Unit from Barcelona Supercomputing Center. |
|
|
|
### Contact |
|
For further information, please send an email to <[email protected]>. |
|
|
|
### Copyright |
|
Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center. |
|
|
|
### License |
|
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
|
|
|
### Funding |
|
This work has been promoted and financed by the European Commission Health and Digital Executive Agency, Connecting Europe Facility, Grant Agreement Nº INEA/CEF/ICT/A2020/2373713, Action Title Open Harmonized and Enriched Procurement Data Platform (nextProcurement), Action number 2020-ES-IA-0255. |
|
|
|
### Disclaimer |
|
<details> |
|
<summary>Click to expand</summary> |
|
|
|
The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0. |
|
|
|
Be aware that the model may have biases and/or any other undesirable distortions. |
|
|
|
When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it) |
|
or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and, |
|
in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence. |
|
|
|
In no event shall the owner and creator of the model (Barcelona Supercomputing Center) |
|
be liable for any results arising from the use made by third parties. |
|
|
|
</details> |
|
|