# SpanMarker
SpanMarker is a model and library for performing span classification, originally used for Named Entity Recognition (NER). However, it could also be applied to detecting Part of Speech tokens (POS tagging). POS Tagging refers to assigning one label per word, where each label corresponds with a part of speech, such as VERB, NOUN or ADJ. Depending on the dataset, labels may be more finegrained, e.g.
```
'"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'
```

We can use SpanMarker for this task by converting a POS Dataset into a 🤗 Dataset with `tokens` and `ner_tags` columns.

### Installing
First of all, make sure that "GPU" is selected as the Hardware Accelerator under `Runtime > Change runtime type`. Once set, we can install the dependencies.

In [9]:
!pip install span_marker -qqq

### Data loading and preprocessing

Afterwards, we can load any Hugging Face Hub dataset with POS data, e.g. [batterydata/pos_tagging](https://huggingface.co/datasets/batterydata/pos_tagging). We can convert this into the format that is required for SpanMarker.



In [8]:
from datasets import load_dataset

dataset_dict = load_dataset("batterydata/pos_tagging")

def labels_to_ner_tags(sample):
    sample["ner_tags"] = [1 if label.startswith("V") else 0 for label in sample.pop("labels")]
    sample["tokens"] = sample.pop("words")
    return sample

dataset_dict = dataset_dict.map(labels_to_ner_tags)
train_dataset = dataset_dict["train"]
eval_dataset = dataset_dict["test"]
labels = ["O", "B-VERB", "I-VERB"]

train_dataset



  0%|          | 0/2 [00:00<?, ?it/s]

Map:   0%|          | 0/13054 [00:00<?, ? examples/s]

Map:   0%|          | 0/1451 [00:00<?, ? examples/s]

Dataset({
    features: ['ner_tags', 'tokens'],
    num_rows: 13054
})

### Model loading
We can initialize a SpanMarker model using a BERT-style encoder (e.g. the multilingual xlm-RoBERTa-large).

In [10]:
from span_marker import SpanMarkerModel, Trainer, SpanMarkerModel

# Initialize a SpanMarker model using a pretrained BERT-style encoder
model_name = "xlm-roberta-large"
model = SpanMarkerModel.from_pretrained(
    model_name,
    labels=labels,
    # SpanMarker hyperparameters:
    model_max_length=128,
    marker_max_length=64,
    entity_max_length=1,
)

Some weights of the model checkpoint at xlm-roberta-large were not used when initializing XLMRobertaModel: ['lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [11]:
from transformers import TrainingArguments

# Prepare the 🤗 transformers training arguments
args = TrainingArguments(
    output_dir="models/span_marker_xlm_roberta_large_verbs",
    # Training Hyperparameters:
    learning_rate=1e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    num_train_epochs=3,
    weight_decay=0.01,
    warmup_ratio=0.1,
    fp16=True,  # Replace `bf16` with `fp16` if your hardware can't use bf16.
    # Other Training parameters
    logging_first_step=True,
    logging_steps=50,
    evaluation_strategy="steps",
    save_strategy="steps",
    eval_steps=1000,
    save_total_limit=2,
    dataloader_num_workers=2,
)

In [12]:
# Initialize the trainer using our model, training args & dataset, and train
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)


INFO:span_marker.label_normalizer:Detected the IOB or IOB2 labeling scheme.


In [13]:
trainer.train()

Label normalizing the train dataset:   0%|          | 0/13054 [00:00<?, ? examples/s]

Tokenizing the train dataset:   0%|          | 0/13054 [00:00<?, ? examples/s]

A total of 10 (0.023546%) entities were missed due to the maximum input length.


Spreading data between multiple samples:   0%|          | 0/13054 [00:00<?, ? examples/s]

INFO:span_marker.trainer:Spread 13054 sentences across 13083 samples, a 0.222154% increase. You can increase `model_max_length` or `marker_max_length` to decrease the number of samples, but recognize that longer samples are slower.


Step,Training Loss,Validation Loss,Overall Precision,Overall Recall,Overall F1,Overall Accuracy
1000,0.036,0.015133,0.991139,0.97326,0.982118,0.9956
2000,0.0126,0.013101,0.985581,0.986418,0.985999,0.996522
3000,0.0175,0.015389,0.973481,0.989389,0.98137,0.995337
4000,0.0115,0.017178,0.982052,0.987054,0.984547,0.996154


Label normalizing the evaluation dataset:   0%|          | 0/1451 [00:00<?, ? examples/s]

Tokenizing the evaluation dataset:   0%|          | 0/1451 [00:00<?, ? examples/s]

Spreading data between multiple samples:   0%|          | 0/1451 [00:00<?, ? examples/s]

INFO:span_marker.trainer:Spread 1451 sentences across 1455 samples, a 0.275672% increase. You can increase `model_max_length` or `marker_max_length` to decrease the number of samples, but recognize that longer samples are slower.


Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

INFO:span_marker.trainer:Spread 1451 sentences across 1455 samples, a 0.275672% increase. You can increase `model_max_length` or `marker_max_length` to decrease the number of samples, but recognize that longer samples are slower.
INFO:span_marker.trainer:Spread 1451 sentences across 1455 samples, a 0.275672% increase. You can increase `model_max_length` or `marker_max_length` to decrease the number of samples, but recognize that longer samples are slower.
INFO:span_marker.trainer:Spread 1451 sentences across 1455 samples, a 0.275672% increase. You can increase `model_max_length` or `marker_max_length` to decrease the number of samples, but recognize that longer samples are slower.


TrainOutput(global_step=4905, training_loss=0.03492981094323957, metrics={'train_runtime': 3646.2474, 'train_samples_per_second': 10.764, 'train_steps_per_second': 1.345, 'total_flos': 1.8283684849207296e+16, 'train_loss': 0.03492981094323957, 'epoch': 3.0})

In [14]:
trainer.save_model("models/span_marker_xlm_roberta_large_verbs/checkpoint-final")

In [17]:
# Compute & save the metrics on the test set
metrics = trainer.evaluate()
trainer.save_metrics("eval", metrics)
metrics

INFO:span_marker.trainer:Spread 1451 sentences across 1455 samples, a 0.275672% increase. You can increase `model_max_length` or `marker_max_length` to decrease the number of samples, but recognize that longer samples are slower.


{'eval_loss': 0.015249394811689854,
 'eval_overall_precision': 0.984514212982605,
 'eval_overall_recall': 0.9849320882852292,
 'eval_overall_f1': 0.9847231063017186,
 'eval_overall_accuracy': 0.99620633331577,
 'eval_runtime': 26.9862,
 'eval_samples_per_second': 53.916,
 'eval_steps_per_second': 13.488,
 'epoch': 3.0}

In [16]:
trainer.model.predict("I'd love to try and figure out how well this works.")

[{'span': 'love',
  'label': 'VERB',
  'score': 0.9993988275527954,
  'char_start_index': 4,
  'char_end_index': 8},
 {'span': 'try',
  'label': 'VERB',
  'score': 0.9999654293060303,
  'char_start_index': 12,
  'char_end_index': 15},
 {'span': 'figure',
  'label': 'VERB',
  'score': 0.9992504715919495,
  'char_start_index': 20,
  'char_end_index': 26}]

In [None]:
!pip install spacy spacy-transformers -qqq
!spacy download en_core_web_trf

In [48]:
import spacy_transformers
import spacy
from spacy.tokens import Doc
from tqdm.autonotebook import tqdm
import seqeval

nlp = spacy.load("en_core_web_trf")

preds = []
golds = []
for tokens, gold_ner_tags in tqdm(zip(eval_dataset["tokens"], eval_dataset["ner_tags"]), total=len(eval_dataset)):
    doc = nlp(Doc(nlp.vocab, tokens))
    pred_ner_tags = ["B-VERB" if token.pos_ in ("VERB", "AUX") else "O" for token in doc]
    gold_ner_tags = ["B-VERB" if tag else "O" for tag in gold_ner_tags]
    preds.append(pred_ner_tags)
    golds.append(gold_ner_tags)

print(seqeval.metrics.classification_report(golds, preds))

  0%|          | 0/1451 [00:00<?, ?it/s]

              precision    recall  f1-score   support

        VERB       0.91      0.97      0.94      4712

   micro avg       0.91      0.97      0.94      4712
   macro avg       0.91      0.97      0.94      4712
weighted avg       0.91      0.97      0.94      4712



Note that these results are not truly indicative of spaCy's power - it was trained on a slightly different label set, probably using a slightly different tokenizer.

However, it's clear that the SpanMarker model is very adept at classifying verbs, as it reached a 98.472 F1. You might be able to get slightly better performance if you use the English-only `roberta-large` instead of the multilingual `xlm-roberta-large`.
Beyond that, a `large` model might be overkill here.

In [59]:
trainer.create_model_card()
!cat models/span_marker_xlm_roberta_large_verbs/README.md

---
tags:
- generated_from_trainer
model-index:
- name: span_marker_xlm_roberta_large_verbs
  results: []
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# span_marker_xlm_roberta_large_verbs

This model is a fine-tuned version of [](https://huggingface.co/) on an unknown dataset.
It achieves the following results on the evaluation set:
- Loss: 0.0152
- Overall Precision: 0.9845
- Overall Recall: 0.9849
- Overall F1: 0.9847
- Overall Accuracy: 0.9962

## Model description

More information needed

## Intended uses & limitations

More information needed

## Training and evaluation data

More information needed

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 4
- eval_batch_size: 4
- seed: 42
- gradient_accumulation_steps: 2


In [None]:
model.push_to_hub("span-marker-xlm-roberta-large-verbs")

https://huggingface.co/tomaarsen/span-marker-xlm-roberta-large-verbs