Ottoman-NER Latin (`fatihburakkaragoz/ottoman-ner-latin`)

This model performs Named Entity Recognition (NER) on Ottoman Turkish texts transliterated into Latin script. It identifies named entities such as persons (PER), locations (LOC), and organizations (ORG) in historical Ottoman Turkish texts. The model is fine-tuned from dbmdz/bert-base-turkish-cased using a custom CONLL-formatted dataset annotated specifically for Ottoman Latin script.

Part of the Ottoman-NLP Project

While this specific model (ottoman-ner-latin) was developed independently, it builds upon the broader Ottoman-NLP research direction initiated within Boğaziçi University - BUCOLIN Lab under the guidance of Prof. Dr. Şaziye Betül Özateş.

Explore more from the lab at: huggingface.co/BUCOLIN
Main project repository: github.com/Ottoman-NLP

Model Details

Model Type: Token Classification (NER)
Base Model: dbmdz/bert-base-turkish-cased
Language: Ottoman Turkish (Latin transliteration)
Training Dataset: Custom manually annotated dataset
Fine-tuned by: Fatih Burak Karagöz
License: MIT
Usage: Historical Turkish NLP, research, digital humanities

Usage

Transformers pipeline

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model = AutoModelForTokenClassification.from_pretrained("fatihburakkaragoz/ottoman-ner-latin")
tokenizer = AutoTokenizer.from_pretrained("fatihburakkaragoz/ottoman-ner-latin")

ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

text = "Emin Bey’in kuklaları bir haftadır Tepebaşı’nda oynuyor."
entities = ner_pipeline(text)
print(entities)

Training & Evaluation

Training Script: scripts/train_latin_ner.py
- Labels: PER (Person), LOC (Location), ORG (Organization), O
- Metrics (seqeval):
  - Precision: e.g., 89.4%
  - Recall: e.g., 87.1%
  - F1 Score: e.g., 88.2%
  - Epochs: 5
  - Batch Size: 8
  - Hardware: RTX 4090, CUDA 12.8

Example Output

[
  {
    "entity_group": "PER",
    "word": "Emin Bey",
    "start": 0,
    "end": 8,
    "score": 0.998
  },
  {
    "entity_group": "LOC",
    "word": "Tepebaşı",
    "start": 34,
    "end": 42,
    "score": 0.997
  }
]

Citation

Please cite the model as:

@software{karagoz_ottoman_ner_latin_2025,
  author = {Karagöz, Fatih Burak},
  title = {Ottoman-NER Latin: A Named Entity Recognition Model for Transliterated Ottoman Turkish},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/fatihburakkaragoz/ottoman-ner-latin},
  note = {Model version 0.2.0. Developed as part of the Ottoman-NLP initiative.}
}

Citation for research

@article{ozates2025building,
  title={Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models},
  author={Özateş, Şaziye Betül and Tıraş, Tuğba Eser and Adak, Elif Esra and Doğan, Berkay and Karagöz, Fatih Burak and Genç, Elif Esma and others},
  journal={arXiv preprint arXiv:2501.04828},
  year={2025}
}

@inproceedings{karagoz2024towards,
  title={Towards a Clean Text Corpus for Ottoman Turkish},
  author={Karagöz, Fatih Burak and Doğan, Berkay and Özateş, Şaziye Betül},
  booktitle={Proceedings of the First Workshop on Natural Language Processing for Turkic Languages},
  year={2024}
}

Contact

Author: Fatih Burak Karagöz
Email: [email protected]
Github: fbkaragoz
web: https://www.karagoz.io

fatihburakkaragoz
/

ottoman-ner-latin

Ottoman-NER Latin (`fatihburakkaragoz/ottoman-ner-latin`)

Part of the Ottoman-NLP Project

Model Details

Usage

Transformers pipeline

Training & Evaluation

Example Output

Citation

Citation for research

Contact

Model tree for fatihburakkaragoz/ottoman-ner-latin

Ottoman-NER Latin (fatihburakkaragoz/ottoman-ner-latin)

Part of the Ottoman-NLP Project

Model Details

Usage

Transformers pipeline

Training & Evaluation

Example Output

Citation

Citation for research

Contact

Model tree for fatihburakkaragoz/ottoman-ner-latin

Ottoman-NER Latin (`fatihburakkaragoz/ottoman-ner-latin`)