Armenian NER Model (XLM-RoBERTa based)
This repository contains a Named Entity Recognition (NER) model for the Armenian language, fine-tuned from the xlm-roberta-base
checkpoint.
The model identifies the following entity types based on the pioNER dataset tags: PER
(Person), LOC
(Location), ORG
(Organization), EVT
(Event), PRO
(Product), FAC
(Facility), ANG
(Animal), DUC
(Document), WRK
(Work of Art), CMP
(Chemical Compound/Drug), MSR
(Measure/Quantity), DTM
(Date/Time), MNY
(Money), PCT
(Percent), LAG
(Language), LAW
(Law), NOR
(Nationality/Religious/Political Group).
This specific checkpoint (daviddallakyan2005/armenian-ner
) corresponds to training run run_16
from the associated project, selected based on the best F1 score on the pioNER validation set during a hyperparameter search involving 36 variations.
Associated GitHub Repository: https://github.com/daviddallakyan2005/armenian-ner-network.git (Contains training, inference, and network analysis scripts)
Model Details
- Base Model:
xlm-roberta-base
(Originating from research associated with Facebook AI Research'sfairseq
library) - Training Data: pioNER Corpus (specifically,
pioner-silver
for training/validation andpioner-gold
for testing, loaded viaconll2003
dataset script). - Fine-tuning Framework:
transformers
,pytorch
- Hyperparameters (run_16):
- Learning Rate:
2e-5
- Weight Decay:
0.01
- Batch Size (per device):
8
- Gradient Accumulation Steps:
1
- Epochs:
7
- Learning Rate:
Intended Uses & Limitations
This model is designed for general-purpose Named Entity Recognition in Armenian text. It leverages the ArmTokenizer
library for pre-tokenization in the inference process shown in the associated project's scripts (armenian-ner-network
), although the transformers
pipeline example below uses the built-in xlm-roberta-base
tokenizer directly.
- Primary Use: NER / Token Classification for Armenian.
- Limitations: Performance might degrade on domains significantly different from the pioNER dataset. The aggregation strategy in the example below is simple; more complex strategies might be needed for optimal entity boundary detection in all cases.
How to Use
You can easily use this model with the transformers
library pipeline:
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
# Load the tokenizer and model from Hugging Face Hub
model_name = "daviddallakyan2005/armenian-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Create NER pipeline
# Use "simple" aggregation for basic entity grouping
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
# Example text
text = "Գրիգոր Նարեկացին հայ միջնադարյան հոգևորական էր, աստվածաբան և բանաստեղծ։ Նա ծնվել է Նարեկ գյուղում։"
# Get predictions
entities = ner_pipeline(text)
print(entities)
# Example Output:
# [
# {'entity_group': 'PER', 'score': 0.99..., 'word': 'Գրիգոր Նարեկացին', 'start': 0, 'end': 16},
# {'entity_group': 'LOC', 'score': 0.98..., 'word': 'Նարեկ', 'start': 87, 'end': 92}
# ]
(See scripts/03_ner/run_ner_inference_segmented.py
in the GitHub repo for an example integrating ArmTokenizer
before passing words to the Hugging Face tokenizer)
Training Procedure
The xlm-roberta-base
model was fine-tuned on the Armenian pioNER dataset using the transformers
Trainer
API. The training involved:
- Loading the
conll2003
format pioNER data. - Tokenizing the text using the
xlm-roberta-base
tokenizer and aligning NER tags to subword tokens (labeling only the first subword of each word). - Setting up
TrainingArguments
with varying hyperparameters (learning rate, weight decay, epochs, gradient accumulation). - Instantiating
AutoModelForTokenClassification
with the correct number of labels and mappings (id2label
,label2id
) derived from the dataset. - Using
DataCollatorForTokenClassification
for batching. - Implementing a
compute_metrics
function usingseqeval
(precision, recall, F1) for evaluation during training. - Running a hyperparameter search over 36 combinations, saving checkpoints and logs for each run.
- Selecting the best model based on the highest F1 score achieved on the validation set (
pioner-silver/dev.conll03
). - Evaluating the best model on the test set (
pioner-gold/test.conll03
).
(See scripts/03_ner/ner_roberta.py
in the GitHub repo for the full training script.)
Evaluation
This model (run_16
) achieved the best F1 score on the pioNER validation set during the hyperparameter search. Final evaluation metrics on the pioNER gold test set are logged in the training artifacts within the associated GitHub project.
Citation
If you use this model or the associated code, please consider citing the GitHub repository:
@misc{armenian_ner_network_2025,
author = {David Dallakyan},
title = {Armenian NER and Network Analysis Project},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/daviddallakyan2005/armenian-ner-network}}
}
Please also cite the original XLM-RoBERTa paper and the pioNER dataset creators if applicable.
- Downloads last month
- 190