ota-roberta-base-ner (Ottoman Turkish NER)
This model is a Named Entity Recognition (NER) model for Ottoman Turkish, fine-tuned from enesyila/ota-roberta-base.
It recognizes PERSON, LOCATION, ORGANIZATION, and MISC entities in Ottoman Turkish texts.
Model Details
- Developed by: Enes Yılandiloğlu
- Model type: Token classification (NER)
- Language(s): Ottoman Turkish (ota)
- License: cc-by-nc-4.0
- Finetuned from: enesyila/ota-roberta-base
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model = AutoModelForTokenClassification.from_pretrained("enesyila/ota-roberta-base-ner")
tokenizer = AutoTokenizer.from_pretrained("enesyila/ota-roberta-base-ner")
nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="average")
text = "Aḥmed Paşanın yerine Edrinedeki Meḥmed Efendi Medresesinden Meḥmed Efendi mevsûl oldu."
print(nlp(text))
[{'entity_group': 'PER','score': 0.9800526,'word': 'Aḥmed Paşa','start': 0,'end': 10},
{'entity_group': 'LOC','score': 0.95372033,'word': 'Edrine','start': 21,'end': 27},
{'entity_group': 'ORG','score': 0.8995747,'word': 'Meḥmed Efendi Medresesinden','start': 32,'end': 59},
{'entity_group': 'PER','score': 0.9849827,'word': 'Meḥmed Efendi','start': 60,'end': 73}]
Training Procedure
- Loss: Focal Loss (γ=2.0)
- Batch size: 32 (train), 32 (eval)
- Optimizer: AdamW
- Learning rate: 2e-5
- Epochs: 50 (early stopping enabled)
- Gradient checkpointing: Enabled
- Mixed precision: Enabled (fp16)
- P.S.: Added non-entity sentences to 30% of each split for regularization
Training Data
The model was fine-tuned on a manually annotated corpus of 10 classical Ottoman Turkish in both prose and verse with IJMES transliteration alphabet, consisting of ~10k NER spans with labels PER
, LOC
, ORG
, MISC
.
Folowing works were used as training data:
- Reşeḥât-ı Muḥyî (1503)
- Mecmaʿu'l-Eşrâf (1557)
- Zübdeti't-Tevâriḫ (1585?)
- Muḥibbî Dîvânı (16th century)
- Taşlıcalı Yaḥyâ Dîvânı (16th century)
- Zâtî Dîvânı (16th century)
- Künhü'l-Aḫbâr (1600?)
- Ḥadâʾiḳu'l-ḥ aḳâʾiḳ Fî Tekmileti'ş-şaḳâʾiḳ (1632-1633)
- Veḳâyiʿü'l-Fużala (1731)
- Tekmiletü'ş-Şaḳâʾiḳ fî hakki ehli'l-hakâʾik (1896-97)
Named entity distribution by dataset split (roughly 80/10/10):
Split | LOC | MISC | ORG | PER | TOTAL |
---|---|---|---|---|---|
Train | 2780 | 866 | 695 | 4023 | 8364 |
Dev | 351 | 121 | 83 | 551 | 1106 |
Test | 388 | 116 | 93 | 509 | 1106 |
Evaluation Results
Span-level results on test set:
Label | Precision | Recall | F1 | Support |
---|---|---|---|---|
LOC | 0.710 | 0.779 | 0.743 | 340 |
MISC | 0.505 | 0.525 | 0.515 | 99 |
ORG | 0.586 | 0.554 | 0.569 | 74 |
PER | 0.751 | 0.808 | 0.779 | 485 |
Span-level (micro):
- Precision: 0.702
- Recall: 0.752
- F1: 0.726
Token-level Results
Label | Precision | Recall | F1 | Support |
---|---|---|---|---|
B-LOC | 0.774 | 0.815 | 0.794 | 340 |
B-MISC | 0.659 | 0.566 | 0.609 | 99 |
B-ORG | 0.646 | 0.575 | 0.609 | 73 |
B-PER | 0.818 | 0.852 | 0.834 | 485 |
I-LOC | 0.731 | 0.800 | 0.764 | 960 |
I-MISC | 0.714 | 0.608 | 0.657 | 521 |
I-ORG | 0.730 | 0.669 | 0.698 | 387 |
I-PER | 0.850 | 0.894 | 0.871 | 2470 |
Token-level (micro):
- Precision: 0.795
- Recall: 0.814
- F1: 0.804
- Macro F1: 0.729
Model Card Author
Enes Yılandiloğlu
Model Card Contact
- Downloads last month
- 149
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support