Model Card for Model ID


BerTurk Ottoman Full DAPT

A domain‐adaptive continuation of dbmdz/bert-base-turkish-128k-cased, pre‐trained on 800 K modern‐Latin Ottoman-Turkish sentences (≈ 14 M tokens) from the OTC Corpus (Özateş et al., 2025). This checkpoint is intended as a drop-in encoder for NER task.


Model Details

Property Value
Base dbmdz/bert-base-turkish-128k-cased
Domain data BUCOLIN/OTC-Corpus
Pre‐training task Masked Language Modeling (MLM)
Epochs 4
Sequence length 128 tokens (chunked)
Batch size 16 (per device)
Learning rate 3 × 10⁻⁵
Warmup steps 500
Weight decay 0.01
Mixed precision fp16
Checkpoint size ≈ full weights, fp16
Vocabulary same as base

Training Data

  • Corpus: BUCOLIN/OTC-Corpus
    • 800 K modern‐Latin transliterations of Ottoman-Turkish text
    • Pre‐split into train/validation (90 %/10 %) during fine‐tuning

Training


# Args
args = TrainingArguments(
    output_dir="BerTurk_Ottoman_Full_DAPT",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=4,
    learning_rate=3e-5,
    eval_strategy="epoch",
    save_strategy="epoch",
    warmup_steps=500,
    weight_decay=0.01,
    fp16=True,
    logging_steps=100,
    save_steps=500,
    eval_steps=500,
    save_total_limit=2,
    load_best_model_at_end=True,
)

Hardware & Training

  • Hardware: Google Colab Pro (T4 GPU, high VRAM).
  • Batch size: 128
  • Final Validation Loss | 2.2306
  • Total DAPT time: ~ 3 hours for 4 epochs

Test Use


from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

# Load model & tokenizer
tokenizer = AutoTokenizer.from_pretrained("cihanunlu/BerTurk_Ottoman_Full_DAPT")
model     = AutoModelForMaskedLM.from_pretrained("cihanunlu/BerTurk_Ottoman_Full_DAPT")


nlp = pipeline("fill-mask", model=model, tokenizer=tokenizer)
res = nlp("Devlet-i Aliyye-i Osmaniyye’nin [MASK] için tedâbîr-i mühimme ittikhāz olunmalıdır.")
print(res)
Downloads last month
168
Safetensors
Model size
184M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cihanunlu/BerTurk_Ottoman_Full_DAPT

Finetuned
(18)
this model
Adapters
1 model

Dataset used to train cihanunlu/BerTurk_Ottoman_Full_DAPT