daviddallakyan2005 commited on
Commit
f3b1860
·
verified ·
1 Parent(s): f73869d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +116 -3
README.md CHANGED
@@ -1,3 +1,116 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: hy
3
+ license: mit
4
+ tags:
5
+ - armenian
6
+ - ner
7
+ - token-classification
8
+ - xlm-roberta
9
+ library_name: transformers
10
+ pipeline_tag: token-classification
11
+ datasets:
12
+ - pioner
13
+ metrics:
14
+ - f1
15
+ widget:
16
+ - text: "Գրիգոր Նարեկացին հայ միջնադարյան հոգևորական էր։"
17
+ ---
18
+
19
+ # Armenian NER Model (XLM-RoBERTa based)
20
+
21
+ This repository contains a Named Entity Recognition (NER) model for the Armenian language, fine-tuned from the `xlm-roberta-base` checkpoint.
22
+
23
+ The model identifies the following entity types based on the pioNER dataset tags: `PER` (Person), `LOC` (Location), `ORG` (Organization), `EVT` (Event), `PRO` (Product), `FAC` (Facility), `ANG` (Animal), `DUC` (Document), `WRK` (Work of Art), `CMP` (Chemical Compound/Drug), `MSR` (Measure/Quantity), `DTM` (Date/Time), `MNY` (Money), `PCT` (Percent), `LAG` (Language), `LAW` (Law), `NOR` (Nationality/Religious/Political Group).
24
+
25
+ This specific checkpoint (`daviddallakyan2005/armenian-ner`) corresponds to training run `run_16` from the associated project, selected based on the best F1 score on the pioNER validation set during a hyperparameter search involving 36 variations.
26
+
27
+ **Associated GitHub Repository:** [https://github.com/daviddallakyan2005/armenian-ner-network.git](https://github.com/daviddallakyan2005/armenian-ner-network.git)
28
+ (Contains training, inference, and network analysis scripts)
29
+
30
+ ## Model Details
31
+
32
+ * **Base Model:** `xlm-roberta-base` (Originating from research associated with Facebook AI Research's [`fairseq`](https://github.com/facebookresearch/fairseq.git) library)
33
+ * **Training Data:** [pioNER Corpus](https://github.com/ispras-texterra/pioner.git) (specifically, `pioner-silver` for training/validation and `pioner-gold` for testing, loaded via `conll2003` dataset script).
34
+ * **Fine-tuning Framework:** `transformers`, `pytorch`
35
+ * **Hyperparameters (run_16):**
36
+ * Learning Rate: `2e-5`
37
+ * Weight Decay: `0.01`
38
+ * Batch Size (per device): `8`
39
+ * Gradient Accumulation Steps: `1`
40
+ * Epochs: `7`
41
+
42
+ ## Intended Uses & Limitations
43
+
44
+ This model is designed for general-purpose Named Entity Recognition in Armenian text. It leverages the [`ArmTokenizer`](https://github.com/DavidDavidsonDK/ArmTokenizer.git) library for pre-tokenization in the inference process shown in the associated project's scripts (`armenian-ner-network`), although the `transformers` pipeline example below uses the built-in `xlm-roberta-base` tokenizer directly.
45
+
46
+ * **Primary Use:** NER / Token Classification for Armenian.
47
+ * **Limitations:** Performance might degrade on domains significantly different from the pioNER dataset. The aggregation strategy in the example below is simple; more complex strategies might be needed for optimal entity boundary detection in all cases.
48
+
49
+ ## How to Use
50
+
51
+ You can easily use this model with the `transformers` library pipeline:
52
+
53
+ ```python
54
+ from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
55
+
56
+ # Load the tokenizer and model from Hugging Face Hub
57
+ model_name = "daviddallakyan2005/armenian-ner"
58
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
59
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
60
+
61
+ # Create NER pipeline
62
+ # Use "simple" aggregation for basic entity grouping
63
+ ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
64
+
65
+ # Example text
66
+ text = "Գրիգոր Նարեկացին հայ միջնադարյան հոգևորական էր, աստվածաբան և բանաստեղծ։ Նա ծնվել է Նարեկ գյուղում։"
67
+
68
+ # Get predictions
69
+ entities = ner_pipeline(text)
70
+ print(entities)
71
+
72
+ # Example Output:
73
+ # [
74
+ # {'entity_group': 'PER', 'score': 0.99..., 'word': 'Գրիգոր Նարեկացին', 'start': 0, 'end': 16},
75
+ # {'entity_group': 'LOC', 'score': 0.98..., 'word': 'Նարեկ', 'start': 87, 'end': 92}
76
+ # ]
77
+ ```
78
+
79
+ *(See `scripts/03_ner/run_ner_inference_segmented.py` in the [GitHub repo](https://github.com/daviddallakyan2005/armenian-ner-network.git) for an example integrating [`ArmTokenizer`](https://github.com/DavidDavidsonDK/ArmTokenizer.git) before passing words to the Hugging Face tokenizer)*
80
+
81
+ ## Training Procedure
82
+
83
+ The `xlm-roberta-base` model was fine-tuned on the Armenian pioNER dataset using the `transformers` `Trainer` API. The training involved:
84
+
85
+ 1. Loading the `conll2003` format pioNER data.
86
+ 2. Tokenizing the text using the `xlm-roberta-base` tokenizer and aligning NER tags to subword tokens (labeling only the first subword of each word).
87
+ 3. Setting up `TrainingArguments` with varying hyperparameters (learning rate, weight decay, epochs, gradient accumulation).
88
+ 4. Instantiating `AutoModelForTokenClassification` with the correct number of labels and mappings (`id2label`, `label2id`) derived from the dataset.
89
+ 5. Using `DataCollatorForTokenClassification` for batching.
90
+ 6. Implementing a `compute_metrics` function using `seqeval` (precision, recall, F1) for evaluation during training.
91
+ 7. Running a hyperparameter search over 36 combinations, saving checkpoints and logs for each run.
92
+ 8. Selecting the best model based on the highest F1 score achieved on the validation set (`pioner-silver/dev.conll03`).
93
+ 9. Evaluating the best model on the test set (`pioner-gold/test.conll03`).
94
+
95
+ (See `scripts/03_ner/ner_roberta.py` in the [GitHub repo](https://github.com/daviddallakyan2005/armenian-ner-network.git) for the full training script.)
96
+
97
+ ## Evaluation
98
+
99
+ This model (`run_16`) achieved the best F1 score on the pioNER validation set during the hyperparameter search. Final evaluation metrics on the pioNER gold test set are logged in the training artifacts within the associated GitHub project.
100
+
101
+ ## Citation
102
+
103
+ If you use this model or the associated code, please consider citing the GitHub repository:
104
+
105
+ ```bibtex
106
+ @misc{armenian_ner_network_2024,
107
+ author = {David Dallakyan},
108
+ title = {Armenian NER and Network Analysis Project},
109
+ year = {2025},
110
+ publisher = {GitHub},
111
+ journal = {GitHub repository},
112
+ howpublished = {\url{https://github.com/daviddallakyan2005/armenian-ner-network}}
113
+ }
114
+ ```
115
+
116
+ Please also cite the original XLM-RoBERTa paper and the pioNER dataset creators if applicable.