ota-roberta-base-ner (Ottoman Turkish NER)

This model is a Named Entity Recognition (NER) model for Ottoman Turkish, fine-tuned from enesyila/ota-roberta-base.
It recognizes PERSON, LOCATION, ORGANIZATION, and MISC entities in Ottoman Turkish texts.

Model Details

Developed by: Enes Yılandiloğlu
Model type: Token classification (NER)
Language(s): Ottoman Turkish (ota)
License: cc-by-nc-4.0
Finetuned from: enesyila/ota-roberta-base

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model = AutoModelForTokenClassification.from_pretrained("enesyila/ota-roberta-base-ner")
tokenizer = AutoTokenizer.from_pretrained("enesyila/ota-roberta-base-ner")

nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="average")

text = "Aḥmed Paşanın yerine Edrinedeki Meḥmed Efendi Medresesinden Meḥmed Efendi mevsûl oldu."
print(nlp(text))

    [{'entity_group': 'PER','score': 0.9800526,'word': 'Aḥmed Paşa','start': 0,'end': 10},
    {'entity_group': 'LOC','score': 0.95372033,'word': 'Edrine','start': 21,'end': 27},
    {'entity_group': 'ORG','score': 0.8995747,'word': 'Meḥmed Efendi Medresesinden','start': 32,'end': 59},
    {'entity_group': 'PER','score': 0.9849827,'word': 'Meḥmed Efendi','start': 60,'end': 73}]

Training Procedure

Loss: Focal Loss (γ=2.0)
Batch size: 32 (train), 32 (eval)
Optimizer: AdamW
Learning rate: 2e-5
Epochs: 50 (early stopping enabled)
Gradient checkpointing: Enabled
Mixed precision: Enabled (fp16)
P.S.: Added non-entity sentences to 30% of each split for regularization

Training Data

The model was fine-tuned on a manually annotated corpus of 10 classical Ottoman Turkish in both prose and verse with IJMES transliteration alphabet, consisting of ~10k NER spans with labels PER, LOC, ORG, MISC. Folowing works were used as training data:

Reşeḥât-ı Muḥyî (1503)
Mecmaʿu'l-Eşrâf (1557)
Zübdeti't-Tevâriḫ (1585?)
Muḥibbî Dîvânı (16th century)
Taşlıcalı Yaḥyâ Dîvânı (16th century)
Zâtî Dîvânı (16th century)
Künhü'l-Aḫbâr (1600?)
Ḥadâʾiḳu'l-ḥ aḳâʾiḳ Fî Tekmileti'ş-şaḳâʾiḳ (1632-1633)
Veḳâyiʿü'l-Fużala (1731)
Tekmiletü'ş-Şaḳâʾiḳ fî hakki ehli'l-hakâʾik (1896-97)

Named entity distribution by dataset split (roughly 80/10/10):

Split	LOC	MISC	ORG	PER	TOTAL
Train	2780	866	695	4023	8364
Dev	351	121	83	551	1106
Test	388	116	93	509	1106

Evaluation Results

Span-level results on test set:

Label	Precision	Recall	F1	Support
LOC	0.710	0.779	0.743	340
MISC	0.505	0.525	0.515	99
ORG	0.586	0.554	0.569	74
PER	0.751	0.808	0.779	485

Span-level (micro):

Precision: 0.702
Recall: 0.752
F1: 0.726

Token-level Results

Label	Precision	Recall	F1	Support
B-LOC	0.774	0.815	0.794	340
B-MISC	0.659	0.566	0.609	99
B-ORG	0.646	0.575	0.609	73
B-PER	0.818	0.852	0.834	485
I-LOC	0.731	0.800	0.764	960
I-MISC	0.714	0.608	0.657	521
I-ORG	0.730	0.669	0.698	387
I-PER	0.850	0.894	0.871	2470

Token-level (micro):

Precision: 0.795
Recall: 0.814
F1: 0.804
Macro F1: 0.729

Model Card Author

Enes Yılandiloğlu

Model Card Contact

[email protected]

enesyila
/

ota-roberta-base-ner