ota-roberta-base-ner (Ottoman Turkish NER)

This model is a Named Entity Recognition (NER) model for Ottoman Turkish, fine-tuned from enesyila/ota-roberta-base.
It recognizes PERSON, LOCATION, ORGANIZATION, and MISC entities in Ottoman Turkish texts.


Model Details

  • Developed by: Enes Yılandiloğlu
  • Model type: Token classification (NER)
  • Language(s): Ottoman Turkish (ota)
  • License: cc-by-nc-4.0
  • Finetuned from: enesyila/ota-roberta-base

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model = AutoModelForTokenClassification.from_pretrained("enesyila/ota-roberta-base-ner")
tokenizer = AutoTokenizer.from_pretrained("enesyila/ota-roberta-base-ner")

nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="average")

text = "Aḥmed Paşanın yerine Edrinedeki Meḥmed Efendi Medresesinden Meḥmed Efendi mevsûl oldu."
print(nlp(text))
    [{'entity_group': 'PER','score': 0.9800526,'word': 'Aḥmed Paşa','start': 0,'end': 10},
    {'entity_group': 'LOC','score': 0.95372033,'word': 'Edrine','start': 21,'end': 27},
    {'entity_group': 'ORG','score': 0.8995747,'word': 'Meḥmed Efendi Medresesinden','start': 32,'end': 59},
    {'entity_group': 'PER','score': 0.9849827,'word': 'Meḥmed Efendi','start': 60,'end': 73}]

Training Procedure

  • Loss: Focal Loss (γ=2.0)
  • Batch size: 32 (train), 32 (eval)
  • Optimizer: AdamW
  • Learning rate: 2e-5
  • Epochs: 50 (early stopping enabled)
  • Gradient checkpointing: Enabled
  • Mixed precision: Enabled (fp16)
  • P.S.: Added non-entity sentences to 30% of each split for regularization

Training Data

The model was fine-tuned on a manually annotated corpus of 10 classical Ottoman Turkish in both prose and verse with IJMES transliteration alphabet, consisting of ~10k NER spans with labels PER, LOC, ORG, MISC. Folowing works were used as training data:

  • Reşeḥât-ı Muḥyî (1503)
  • Mecmaʿu'l-Eşrâf (1557)
  • Zübdeti't-Tevâriḫ (1585?)
  • Muḥibbî Dîvânı (16th century)
  • Taşlıcalı Yaḥyâ Dîvânı (16th century)
  • Zâtî Dîvânı (16th century)
  • Künhü'l-Aḫbâr (1600?)
  • Ḥadâʾiḳu'l-ḥ aḳâʾiḳ Fî Tekmileti'ş-şaḳâʾiḳ (1632-1633)
  • Veḳâyiʿü'l-Fużala (1731)
  • Tekmiletü'ş-Şaḳâʾiḳ fî hakki ehli'l-hakâʾik (1896-97)

Named entity distribution by dataset split (roughly 80/10/10):

Split LOC MISC ORG PER TOTAL
Train 2780 866 695 4023 8364
Dev 351 121 83 551 1106
Test 388 116 93 509 1106

Evaluation Results

Span-level results on test set:

Label Precision Recall F1 Support
LOC 0.710 0.779 0.743 340
MISC 0.505 0.525 0.515 99
ORG 0.586 0.554 0.569 74
PER 0.751 0.808 0.779 485

Span-level (micro):

  • Precision: 0.702
  • Recall: 0.752
  • F1: 0.726

Token-level Results

Label Precision Recall F1 Support
B-LOC 0.774 0.815 0.794 340
B-MISC 0.659 0.566 0.609 99
B-ORG 0.646 0.575 0.609 73
B-PER 0.818 0.852 0.834 485
I-LOC 0.731 0.800 0.764 960
I-MISC 0.714 0.608 0.657 521
I-ORG 0.730 0.669 0.698 387
I-PER 0.850 0.894 0.871 2470

Token-level (micro):

  • Precision: 0.795
  • Recall: 0.814
  • F1: 0.804
  • Macro F1: 0.729

Model Card Author

Enes Yılandiloğlu

Model Card Contact

[email protected]

Downloads last month
149
Safetensors
Model size
278M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for enesyila/ota-roberta-base-ner

Finetuned
(1)
this model