Armenian NER Model (XLM-RoBERTa based)

This repository contains a Named Entity Recognition (NER) model for the Armenian language, fine-tuned from the xlm-roberta-base checkpoint.

The model identifies the following entity types based on the pioNER dataset tags: PER (Person), LOC (Location), ORG (Organization), EVT (Event), PRO (Product), FAC (Facility), ANG (Animal), DUC (Document), WRK (Work of Art), CMP (Chemical Compound/Drug), MSR (Measure/Quantity), DTM (Date/Time), MNY (Money), PCT (Percent), LAG (Language), LAW (Law), NOR (Nationality/Religious/Political Group).

This specific checkpoint (daviddallakyan2005/armenian-ner) corresponds to training run run_16 from the associated project, selected based on the best F1 score on the pioNER validation set during a hyperparameter search involving 36 variations.

Associated GitHub Repository: https://github.com/daviddallakyan2005/armenian-ner-network.git (Contains training, inference, and network analysis scripts)

Model Details

Base Model: xlm-roberta-base (Originating from research associated with Facebook AI Research's fairseq library)
Training Data: pioNER Corpus (specifically, pioner-silver for training/validation and pioner-gold for testing, loaded via conll2003 dataset script).
Fine-tuning Framework: transformers, pytorch
Hyperparameters (run_16):
- Learning Rate: 2e-5
- Weight Decay: 0.01
- Batch Size (per device): 8
- Gradient Accumulation Steps: 1
- Epochs: 7

Intended Uses & Limitations

This model is designed for general-purpose Named Entity Recognition in Armenian text. It leverages the ArmTokenizer library for pre-tokenization in the inference process shown in the associated project's scripts (armenian-ner-network), although the transformers pipeline example below uses the built-in xlm-roberta-base tokenizer directly.

Primary Use: NER / Token Classification for Armenian.
Limitations: Performance might degrade on domains significantly different from the pioNER dataset. The aggregation strategy in the example below is simple; more complex strategies might be needed for optimal entity boundary detection in all cases.

How to Use

You can easily use this model with the transformers library pipeline:

from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification

# Load the tokenizer and model from Hugging Face Hub
model_name = "daviddallakyan2005/armenian-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create NER pipeline
# Use "simple" aggregation for basic entity grouping
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Example text
text = "Գրիգոր Նարեկացին հայ միջնադարյան հոգևորական էր, աստվածաբան և բանաստեղծ։ Նա ծնվել է Նարեկ գյուղում։"

# Get predictions
entities = ner_pipeline(text)
print(entities)

# Example Output:
# [
#  {'entity_group': 'PER', 'score': 0.99..., 'word': 'Գրիգոր Նարեկացին', 'start': 0, 'end': 16},
#  {'entity_group': 'LOC', 'score': 0.98..., 'word': 'Նարեկ', 'start': 87, 'end': 92}
# ]

(See scripts/03_ner/run_ner_inference_segmented.py in the GitHub repo for an example integrating ArmTokenizer before passing words to the Hugging Face tokenizer)

Training Procedure

The xlm-roberta-base model was fine-tuned on the Armenian pioNER dataset using the transformers Trainer API. The training involved:

Loading the conll2003 format pioNER data.
Tokenizing the text using the xlm-roberta-base tokenizer and aligning NER tags to subword tokens (labeling only the first subword of each word).
Setting up TrainingArguments with varying hyperparameters (learning rate, weight decay, epochs, gradient accumulation).
Instantiating AutoModelForTokenClassification with the correct number of labels and mappings (id2label, label2id) derived from the dataset.
Using DataCollatorForTokenClassification for batching.
Implementing a compute_metrics function using seqeval (precision, recall, F1) for evaluation during training.
Running a hyperparameter search over 36 combinations, saving checkpoints and logs for each run.
Selecting the best model based on the highest F1 score achieved on the validation set (pioner-silver/dev.conll03).
Evaluating the best model on the test set (pioner-gold/test.conll03).

(See scripts/03_ner/ner_roberta.py in the GitHub repo for the full training script.)

Evaluation

This model (run_16) achieved the best F1 score on the pioNER validation set during the hyperparameter search. Final evaluation metrics on the pioNER gold test set are logged in the training artifacts within the associated GitHub project.

Citation

If you use this model or the associated code, please consider citing the GitHub repository:

@misc{armenian_ner_network_2025,
  author = {David Dallakyan},
  title = {Armenian NER and Network Analysis Project},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/daviddallakyan2005/armenian-ner-network}}
}

Please also cite the original XLM-RoBERTa paper and the pioNER dataset creators if applicable.