Ottoman-NER Latin (fatihburakkaragoz/ottoman-ner-latin
)
This model performs Named Entity Recognition (NER) on Ottoman Turkish texts transliterated into Latin script. It identifies named entities such as persons (PER), locations (LOC), and organizations (ORG) in historical Ottoman Turkish texts. The model is fine-tuned from dbmdz/bert-base-turkish-cased
using a custom CONLL-formatted dataset annotated specifically for Ottoman Latin script.
Part of the Ottoman-NLP Project
While this specific model (ottoman-ner-latin
) was developed independently, it builds upon the broader Ottoman-NLP research direction initiated within Boğaziçi University - BUCOLIN Lab under the guidance of Prof. Dr. Şaziye Betül Özateş.
Explore more from the lab at: huggingface.co/BUCOLIN
Main project repository: github.com/Ottoman-NLP
Model Details
- Model Type: Token Classification (NER)
- Base Model:
dbmdz/bert-base-turkish-cased
- Language: Ottoman Turkish (Latin transliteration)
- Training Dataset: Custom manually annotated dataset
- Fine-tuned by: Fatih Burak Karagöz
- License: MIT
- Usage: Historical Turkish NLP, research, digital humanities
Usage
Transformers pipeline
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model = AutoModelForTokenClassification.from_pretrained("fatihburakkaragoz/ottoman-ner-latin")
tokenizer = AutoTokenizer.from_pretrained("fatihburakkaragoz/ottoman-ner-latin")
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "Emin Bey’in kuklaları bir haftadır Tepebaşı’nda oynuyor."
entities = ner_pipeline(text)
print(entities)
Training & Evaluation
Training Script: scripts/train_latin_ner.py
Labels: PER (Person), LOC (Location), ORG (Organization), O
Metrics (seqeval):
Precision: e.g., 89.4%
Recall: e.g., 87.1%
F1 Score: e.g., 88.2%
Epochs: 5
Batch Size: 8
Hardware: RTX 4090, CUDA 12.8
Example Output
[
{
"entity_group": "PER",
"word": "Emin Bey",
"start": 0,
"end": 8,
"score": 0.998
},
{
"entity_group": "LOC",
"word": "Tepebaşı",
"start": 34,
"end": 42,
"score": 0.997
}
]
Citation
Please cite the model as:
@software{karagoz_ottoman_ner_latin_2025,
author = {Karagöz, Fatih Burak},
title = {Ottoman-NER Latin: A Named Entity Recognition Model for Transliterated Ottoman Turkish},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/fatihburakkaragoz/ottoman-ner-latin},
note = {Model version 0.2.0. Developed as part of the Ottoman-NLP initiative.}
}
Citation for research
@article{ozates2025building,
title={Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models},
author={Özateş, Şaziye Betül and Tıraş, Tuğba Eser and Adak, Elif Esra and Doğan, Berkay and Karagöz, Fatih Burak and Genç, Elif Esma and others},
journal={arXiv preprint arXiv:2501.04828},
year={2025}
}
@inproceedings{karagoz2024towards,
title={Towards a Clean Text Corpus for Ottoman Turkish},
author={Karagöz, Fatih Burak and Doğan, Berkay and Özateş, Şaziye Betül},
booktitle={Proceedings of the First Workshop on Natural Language Processing for Turkic Languages},
year={2024}
}
Contact
- Author: Fatih Burak Karagöz
- Email: [email protected]
- Github: fbkaragoz
- web: https://www.karagoz.io
- Downloads last month
- 32
Model tree for fatihburakkaragoz/ottoman-ner-latin
Base model
dbmdz/bert-base-turkish-cased