ota-roberta-base-ner (Ottoman Turkish NER)
This model is a Named Entity Recognition (NER) model for Ottoman Turkish, fine-tuned from enesyila/ota-roberta-base.
It recognizes PERSON, LOCATION, ORGANIZATION, and MISC entities in Ottoman Turkish texts.
Model Details
- Developed by: Enes Yılandiloğlu
- Model type: Token classification (NER)
- Language(s): Ottoman Turkish (ota)
- License: cc-by-nc-4.0
- Finetuned from: enesyila/ota-roberta-base
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model = AutoModelForTokenClassification.from_pretrained("enesyila/ota-roberta-base-ner")
tokenizer = AutoTokenizer.from_pretrained("enesyila/ota-roberta-base-ner")
nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="average")
text = "Aḥmed Paşanın yerine Edrinedeki Meḥmed Efendi Medresesinden Meḥmed Efendi mevsûl oldu."
print(nlp(text))
[{'entity_group': 'PER','score': 0.9800526,'word': 'Aḥmed Paşa','start': 0,'end': 10},
{'entity_group': 'LOC','score': 0.95372033,'word': 'Edrine','start': 21,'end': 27},
{'entity_group': 'ORG','score': 0.8995747,'word': 'Meḥmed Efendi Medresesinden','start': 32,'end': 59},
{'entity_group': 'PER','score': 0.9849827,'word': 'Meḥmed Efendi','start': 60,'end': 73}]
Training Procedure
- Loss: Cross-entropy loss
- Batch size: 16 (train), 16 (eval)
- Optimizer: AdamW
- Learning rate: 3e-5
- Learning rate scheduler: Linear
- Warmup ratio: 0.01
- Epochs: 10 (early stopping enabled)
- Gradient checkpointing: Enabled
- Mixed precision: Enabled (fp16)
Training Data
The model was fine-tuned on a manually annotated corpus of 5 classical Ottoman Turkish in both prose and verse with IJMES transliteration alphabet, consisting of 6,992 NER spans with labels PER
, LOC
, ORG
, MISC
.
Folowing works were used as training data:
- Ḳıṣâṣ-i Enbiyâ (16th century)
- Zeyl-i Şakâʾik (17th century)
- Veḳâyiʿü'l-Fużala (1731)
- Neticetü'l-Fikriyye (18th century)
- Silkü'l-Leʾal-i ʿÂl-i Os̱mân (18th century)
Named entity distribution by dataset split (roughly 80/10/10):
Split | LOC | MISC | ORG | PER | TOTAL |
---|---|---|---|---|---|
Train | 1313 | 609 | 813 | 2835 | 5570 |
Dev | 147 | 68 | 133 | 365 | 713 |
Test | 162 | 61 | 124 | 362 | 709 |
Evaluation Results
Span-level results on test set:
Label | Precision | Recall | F1-score | Support |
---|---|---|---|---|
LOC | 0.8971 | 0.9385 | 0.9173 | 195 |
MISC | 0.7317 | 0.7895 | 0.7595 | 76 |
ORG | 0.9195 | 0.9195 | 0.9195 | 149 |
PER | 0.9066 | 0.9278 | 0.9171 | 471 |
Span-level (micro avg):
- Precision: 0.8909
- Recall: 0.9169
- F1: 0.9038
Span-level (macro avg):
- Precision: 0.8637
- Recall: 0.8938
- F1: 0.8783
Token-level results (excluding “O” label)
Label | Precision | Recall | F1-score | Support |
---|---|---|---|---|
B-PER | 0.9470 | 0.9490 | 0.9480 | 471 |
I-PER | 0.9574 | 0.9662 | 0.9618 | 2956 |
B-LOC | 0.9254 | 0.9538 | 0.9394 | 195 |
I-LOC | 0.9176 | 0.9125 | 0.9150 | 537 |
B-ORG | 0.9589 | 0.9396 | 0.9492 | 149 |
I-ORG | 0.9701 | 0.9602 | 0.9651 | 979 |
B-MISC | 0.8800 | 0.8684 | 0.8742 | 76 |
I-MISC | 0.9128 | 0.8738 | 0.8929 | 515 |
Token-level (macro avg, excl. “O”):
- Precision: 0.9336
- Recall: 0.9279
- F1: 0.9307
Token-level (weighted avg, excl. “O”):
- Precision: 0.9491
- Recall: 0.9495
- F1: 0.9493
Model Card Author
Enes Yılandiloğlu
Model Card Contact
- Downloads last month
- 7
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for enesyila/ota-roberta-base-ner
Collection including enesyila/ota-roberta-base-ner
Collection
This collection includes NLP models and datasets for Ottoman Turkish.
•
6 items
•
Updated