Model Card for NER-base

Globalise NER token-classification model, development version.

Model Details

Model Description

This is the first version of a NER model developed for the Globalise project.

Developed by: Sophie Arnoult
Shared by: Globalise Team
Funded by: NWO
Model type: token classification

Uses

Named-Entity tagging of historical (17th-18th century), VOC-related Dutch documents.

Bias, Risks, and Limitations

The texts the model was fine-tuned on are heavily biased, representing colonial standpoints. While care has been taken in designing the labelset and annotating the data, biases may remain when applying the model on similar data; the model has not been tested on other data.

This is a development version. The training and development data consist of VOC missives data enriched with new annotations. Most entity types used in Globalise are not present in the VOC missives data, while the new annotations are limited in number. Performance on these may therefore not be representative.

Training Details

Training Data

The training and development data consist of

GM NER corpus (datasplit-all-standard, train/dev data), where labels are mapped to their Globalise equivalents
Globalise annotated data (first set of annotations, to be extended and published at a later date)

The data are pretokenized with Spacy. Sequences are split at 240 word tokens.

Training Procedure

Training Hyperparameters

Training regime: fp32
Optimizer: Adam, learning rate 3e-5
max-sequence-length: 512
batch size: 32
max-epochs: 20

Evaluation

Model selected based on validation weighted multiclass F1 score, using a single seed.

Results

label	precision	recall	f1-score	support
CMTY_NAME	0.72	0.80	0.76	109
CMTY_QUAL	1.00	0.67	0.80	9
CMTY_QUANT	0.76	0.85	0.80	66
DATE	0.48	0.53	0.51	43
DOC	0.61	0.55	0.58	20
ETH_REL	0.78	0.81	0.79	31
LOC_ADJ	0.91	0.96	0.94	464
LOC_NAME	0.91	0.94	0.92	1324
ORG	0.92	0.87	0.89	265
PER_ATTR	0.69	0.82	0.75	44
PER_NAME	0.80	0.87	0.83	613
PRF	0.70	0.76	0.73	97
SHIP	0.89	0.86	0.87	519
SHIP_TYPE	0.79	0.82	0.81	33
STATUS	0.96	0.96	0.96	27
micro avg	0.86	0.89	0.88	3664
macro avg	0.79	0.80	0.80	3664
weighted avg	0.86	0.89	0.88	3664

Technical Specifications

Compute Infrastructure

SURF Snellius

Hardware

A100

Downloads last month: 6

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for globalise/NER-base

Base model

FacebookAI/xlm-roberta-base

Finetuned

(3563)

this model