ota-roberta-base-ner (Ottoman Turkish NER)

This model is a Named Entity Recognition (NER) model for Ottoman Turkish, fine-tuned from enesyila/ota-roberta-base.
It recognizes PERSON, LOCATION, ORGANIZATION, and MISC entities in Ottoman Turkish texts.


Model Details

  • Developed by: Enes Yılandiloğlu
  • Model type: Token classification (NER)
  • Language(s): Ottoman Turkish (ota)
  • License: cc-by-nc-4.0
  • Finetuned from: enesyila/ota-roberta-base

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model = AutoModelForTokenClassification.from_pretrained("enesyila/ota-roberta-base-ner")
tokenizer = AutoTokenizer.from_pretrained("enesyila/ota-roberta-base-ner")

nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="average")

text = "Aḥmed Paşanın yerine Edrinedeki Meḥmed Efendi Medresesinden Meḥmed Efendi mevsûl oldu."
print(nlp(text))
[{'entity_group': 'PER','score': 0.9800526,'word': 'Aḥmed Paşa','start': 0,'end': 10},
{'entity_group': 'LOC','score': 0.95372033,'word': 'Edrine','start': 21,'end': 27},
{'entity_group': 'ORG','score': 0.8995747,'word': 'Meḥmed Efendi Medresesinden','start': 32,'end': 59},
{'entity_group': 'PER','score': 0.9849827,'word': 'Meḥmed Efendi','start': 60,'end': 73}]

Training Procedure

  • Loss: Cross-entropy loss
  • Batch size: 16 (train), 16 (eval)
  • Optimizer: AdamW
  • Learning rate: 3e-5
  • Learning rate scheduler: Linear
  • Warmup ratio: 0.01
  • Epochs: 10 (early stopping enabled)
  • Gradient checkpointing: Enabled
  • Mixed precision: Enabled (fp16)

Training Data

The model was fine-tuned on a manually annotated corpus of 5 classical Ottoman Turkish in both prose and verse with IJMES transliteration alphabet, consisting of 6,992 NER spans with labels PER, LOC, ORG, MISC. Folowing works were used as training data:

  • Ḳıṣâṣ-i Enbiyâ (16th century)
  • Zeyl-i Şakâʾik (17th century)
  • Veḳâyiʿü'l-Fużala (1731)
  • Neticetü'l-Fikriyye (18th century)
  • Silkü'l-Leʾal-i ʿÂl-i Os̱mân (18th century)

Named entity distribution by dataset split (roughly 80/10/10):

Split LOC MISC ORG PER TOTAL
Train 1313 609 813 2835 5570
Dev 147 68 133 365 713
Test 162 61 124 362 709

Evaluation Results

Span-level results on test set:

Label Precision Recall F1-score Support
LOC 0.8971 0.9385 0.9173 195
MISC 0.7317 0.7895 0.7595 76
ORG 0.9195 0.9195 0.9195 149
PER 0.9066 0.9278 0.9171 471

Span-level (micro avg):

  • Precision: 0.8909
  • Recall: 0.9169
  • F1: 0.9038

Span-level (macro avg):

  • Precision: 0.8637
  • Recall: 0.8938
  • F1: 0.8783

Token-level results (excluding “O” label)

Label Precision Recall F1-score Support
B-PER 0.9470 0.9490 0.9480 471
I-PER 0.9574 0.9662 0.9618 2956
B-LOC 0.9254 0.9538 0.9394 195
I-LOC 0.9176 0.9125 0.9150 537
B-ORG 0.9589 0.9396 0.9492 149
I-ORG 0.9701 0.9602 0.9651 979
B-MISC 0.8800 0.8684 0.8742 76
I-MISC 0.9128 0.8738 0.8929 515

Token-level (macro avg, excl. “O”):

  • Precision: 0.9336
  • Recall: 0.9279
  • F1: 0.9307

Token-level (weighted avg, excl. “O”):

  • Precision: 0.9491
  • Recall: 0.9495
  • F1: 0.9493

Model Card Author

Enes Yılandiloğlu

Model Card Contact

[email protected]

Downloads last month
7
Safetensors
Model size
277M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for enesyila/ota-roberta-base-ner

Finetuned
(2)
this model

Collection including enesyila/ota-roberta-base-ner