--- language: hy license: mit tags: - armenian - ner - token-classification - xlm-roberta library_name: transformers pipeline_tag: token-classification datasets: - pioner metrics: - f1 widget: - text: "Գրիգոր Նարեկացին հայ միջնադարյան հոգևորական էր։" --- # Armenian NER Model (XLM-RoBERTa based) This repository contains a Named Entity Recognition (NER) model for the Armenian language, fine-tuned from the `xlm-roberta-base` checkpoint. The model identifies the following entity types based on the pioNER dataset tags: `PER` (Person), `LOC` (Location), `ORG` (Organization), `EVT` (Event), `PRO` (Product), `FAC` (Facility), `ANG` (Animal), `DUC` (Document), `WRK` (Work of Art), `CMP` (Chemical Compound/Drug), `MSR` (Measure/Quantity), `DTM` (Date/Time), `MNY` (Money), `PCT` (Percent), `LAG` (Language), `LAW` (Law), `NOR` (Nationality/Religious/Political Group). This specific checkpoint (`daviddallakyan2005/armenian-ner`) corresponds to training run `run_16` from the associated project, selected based on the best F1 score on the pioNER validation set during a hyperparameter search involving 36 variations. **Associated GitHub Repository:** [https://github.com/daviddallakyan2005/armenian-ner-network.git](https://github.com/daviddallakyan2005/armenian-ner-network.git) (Contains training, inference, and network analysis scripts) ## Model Details * **Base Model:** `xlm-roberta-base` (Originating from research associated with Facebook AI Research's [`fairseq`](https://github.com/facebookresearch/fairseq.git) library) * **Training Data:** [pioNER Corpus](https://github.com/ispras-texterra/pioner.git) (specifically, `pioner-silver` for training/validation and `pioner-gold` for testing, loaded via `conll2003` dataset script). * **Fine-tuning Framework:** `transformers`, `pytorch` * **Hyperparameters (run_16):** * Learning Rate: `2e-5` * Weight Decay: `0.01` * Batch Size (per device): `8` * Gradient Accumulation Steps: `1` * Epochs: `7` ## Intended Uses & Limitations This model is designed for general-purpose Named Entity Recognition in Armenian text. It leverages the [`ArmTokenizer`](https://github.com/DavidDavidsonDK/ArmTokenizer.git) library for pre-tokenization in the inference process shown in the associated project's scripts (`armenian-ner-network`), although the `transformers` pipeline example below uses the built-in `xlm-roberta-base` tokenizer directly. * **Primary Use:** NER / Token Classification for Armenian. * **Limitations:** Performance might degrade on domains significantly different from the pioNER dataset. The aggregation strategy in the example below is simple; more complex strategies might be needed for optimal entity boundary detection in all cases. ## How to Use You can easily use this model with the `transformers` library pipeline: ```python from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification # Load the tokenizer and model from Hugging Face Hub model_name = "daviddallakyan2005/armenian-ner" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name) # Create NER pipeline # Use "simple" aggregation for basic entity grouping ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") # Example text text = "Գրիգոր Նարեկացին հայ միջնադարյան հոգևորական էր, աստվածաբան և բանաստեղծ։ Նա ծնվել է Նարեկ գյուղում։" # Get predictions entities = ner_pipeline(text) print(entities) # Example Output: # [ # {'entity_group': 'PER', 'score': 0.99..., 'word': 'Գրիգոր Նարեկացին', 'start': 0, 'end': 16}, # {'entity_group': 'LOC', 'score': 0.98..., 'word': 'Նարեկ', 'start': 87, 'end': 92} # ] ``` *(See `scripts/03_ner/run_ner_inference_segmented.py` in the [GitHub repo](https://github.com/daviddallakyan2005/armenian-ner-network.git) for an example integrating [`ArmTokenizer`](https://github.com/DavidDavidsonDK/ArmTokenizer.git) before passing words to the Hugging Face tokenizer)* ## Training Procedure The `xlm-roberta-base` model was fine-tuned on the Armenian pioNER dataset using the `transformers` `Trainer` API. The training involved: 1. Loading the `conll2003` format pioNER data. 2. Tokenizing the text using the `xlm-roberta-base` tokenizer and aligning NER tags to subword tokens (labeling only the first subword of each word). 3. Setting up `TrainingArguments` with varying hyperparameters (learning rate, weight decay, epochs, gradient accumulation). 4. Instantiating `AutoModelForTokenClassification` with the correct number of labels and mappings (`id2label`, `label2id`) derived from the dataset. 5. Using `DataCollatorForTokenClassification` for batching. 6. Implementing a `compute_metrics` function using `seqeval` (precision, recall, F1) for evaluation during training. 7. Running a hyperparameter search over 36 combinations, saving checkpoints and logs for each run. 8. Selecting the best model based on the highest F1 score achieved on the validation set (`pioner-silver/dev.conll03`). 9. Evaluating the best model on the test set (`pioner-gold/test.conll03`). (See `scripts/03_ner/ner_roberta.py` in the [GitHub repo](https://github.com/daviddallakyan2005/armenian-ner-network.git) for the full training script.) ## Evaluation This model (`run_16`) achieved the best F1 score on the pioNER validation set during the hyperparameter search. Final evaluation metrics on the pioNER gold test set are logged in the training artifacts within the associated GitHub project. ## Citation If you use this model or the associated code, please consider citing the GitHub repository: ```bibtex @misc{armenian_ner_network_2025, author = {David Dallakyan}, title = {Armenian NER and Network Analysis Project}, year = {2025}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/daviddallakyan2005/armenian-ner-network}} } ``` Please also cite the original XLM-RoBERTa paper and the pioNER dataset creators if applicable.