Ottoman-NER Latin (fatihburakkaragoz/ottoman-ner-latin)

This model performs Named Entity Recognition (NER) on Ottoman Turkish texts transliterated into Latin script. It identifies named entities such as persons (PER), locations (LOC), and organizations (ORG) in historical Ottoman Turkish texts. The model is fine-tuned from dbmdz/bert-base-turkish-cased using a custom CONLL-formatted dataset annotated specifically for Ottoman Latin script.


Part of the Ottoman-NLP Project

While this specific model (ottoman-ner-latin) was developed independently, it builds upon the broader Ottoman-NLP research direction initiated within Boğaziçi University - BUCOLIN Lab under the guidance of Prof. Dr. Şaziye Betül Özateş.

Explore more from the lab at: huggingface.co/BUCOLIN
Main project repository: github.com/Ottoman-NLP


Model Details

  • Model Type: Token Classification (NER)
  • Base Model: dbmdz/bert-base-turkish-cased
  • Language: Ottoman Turkish (Latin transliteration)
  • Training Dataset: Custom manually annotated dataset
  • Fine-tuned by: Fatih Burak Karagöz
  • License: MIT
  • Usage: Historical Turkish NLP, research, digital humanities

Usage

Transformers pipeline

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model = AutoModelForTokenClassification.from_pretrained("fatihburakkaragoz/ottoman-ner-latin")
tokenizer = AutoTokenizer.from_pretrained("fatihburakkaragoz/ottoman-ner-latin")

ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

text = "Emin Bey’in kuklaları bir haftadır Tepebaşı’nda oynuyor."
entities = ner_pipeline(text)
print(entities)

Training & Evaluation

  • Training Script: scripts/train_latin_ner.py

    • Labels: PER (Person), LOC (Location), ORG (Organization), O

    • Metrics (seqeval):

      • Precision: e.g., 89.4%

      • Recall: e.g., 87.1%

      • F1 Score: e.g., 88.2%

      • Epochs: 5

      • Batch Size: 8

      • Hardware: RTX 4090, CUDA 12.8


Example Output

[
  {
    "entity_group": "PER",
    "word": "Emin Bey",
    "start": 0,
    "end": 8,
    "score": 0.998
  },
  {
    "entity_group": "LOC",
    "word": "Tepebaşı",
    "start": 34,
    "end": 42,
    "score": 0.997
  }
]

Citation

Please cite the model as:

@software{karagoz_ottoman_ner_latin_2025,
  author = {Karagöz, Fatih Burak},
  title = {Ottoman-NER Latin: A Named Entity Recognition Model for Transliterated Ottoman Turkish},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/fatihburakkaragoz/ottoman-ner-latin},
  note = {Model version 0.2.0. Developed as part of the Ottoman-NLP initiative.}
}

Citation for research

@article{ozates2025building,
  title={Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models},
  author={Özateş, Şaziye Betül and Tıraş, Tuğba Eser and Adak, Elif Esra and Doğan, Berkay and Karagöz, Fatih Burak and Genç, Elif Esma and others},
  journal={arXiv preprint arXiv:2501.04828},
  year={2025}
}

@inproceedings{karagoz2024towards,
  title={Towards a Clean Text Corpus for Ottoman Turkish},
  author={Karagöz, Fatih Burak and Doğan, Berkay and Özateş, Şaziye Betül},
  booktitle={Proceedings of the First Workshop on Natural Language Processing for Turkic Languages},
  year={2024}
}

Contact

Downloads last month
32
Safetensors
Model size
110M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fatihburakkaragoz/ottoman-ner-latin

Finetuned
(189)
this model