Model Card for NER-base
Globalise NER token-classification model, development version.
Model Details
Model Description
This is the first version of a NER model developed for the Globalise project.
- Developed by: Sophie Arnoult
- Shared by: Globalise Team
- Funded by: NWO
- Model type: token classification
Uses
Named-Entity tagging of historical (17th-18th century), VOC-related Dutch documents.
Bias, Risks, and Limitations
The texts the model was fine-tuned on are heavily biased, representing colonial standpoints. While care has been taken in designing the labelset and annotating the data, biases may remain when applying the model on similar data; the model has not been tested on other data.
This is a development version. The training and development data consist of VOC missives data enriched with new annotations. Most entity types used in Globalise are not present in the VOC missives data, while the new annotations are limited in number. Performance on these may therefore not be representative.
Training Details
Training Data
The training and development data consist of
- GM NER corpus (datasplit-all-standard, train/dev data), where labels are mapped to their Globalise equivalents
- Globalise annotated data (first set of annotations, to be extended and published at a later date)
The data are pretokenized with Spacy. Sequences are split at 240 word tokens.
Training Procedure
Training Hyperparameters
- Training regime: fp32
- Optimizer: Adam, learning rate 3e-5
- max-sequence-length: 512
- batch size: 32
- max-epochs: 20
Evaluation
Model selected based on validation weighted multiclass F1 score, using a single seed.
Results
| label | precision | recall | f1-score | support | 
|---|---|---|---|---|
| CMTY_NAME | 0.72 | 0.80 | 0.76 | 109 | 
| CMTY_QUAL | 1.00 | 0.67 | 0.80 | 9 | 
| CMTY_QUANT | 0.76 | 0.85 | 0.80 | 66 | 
| DATE | 0.48 | 0.53 | 0.51 | 43 | 
| DOC | 0.61 | 0.55 | 0.58 | 20 | 
| ETH_REL | 0.78 | 0.81 | 0.79 | 31 | 
| LOC_ADJ | 0.91 | 0.96 | 0.94 | 464 | 
| LOC_NAME | 0.91 | 0.94 | 0.92 | 1324 | 
| ORG | 0.92 | 0.87 | 0.89 | 265 | 
| PER_ATTR | 0.69 | 0.82 | 0.75 | 44 | 
| PER_NAME | 0.80 | 0.87 | 0.83 | 613 | 
| PRF | 0.70 | 0.76 | 0.73 | 97 | 
| SHIP | 0.89 | 0.86 | 0.87 | 519 | 
| SHIP_TYPE | 0.79 | 0.82 | 0.81 | 33 | 
| STATUS | 0.96 | 0.96 | 0.96 | 27 | 
| micro avg | 0.86 | 0.89 | 0.88 | 3664 | 
| macro avg | 0.79 | 0.80 | 0.80 | 3664 | 
| weighted avg | 0.86 | 0.89 | 0.88 | 3664 | 
Technical Specifications
Compute Infrastructure
SURF Snellius
Hardware
A100
- Downloads last month
- 6
Model tree for globalise/NER-base
Base model
FacebookAI/xlm-roberta-base