Model Card for Model ID
BerTurk Ottoman Full DAPT
A domain‐adaptive continuation of dbmdz/bert-base-turkish-128k-cased, pre‐trained on 800 K modern‐Latin Ottoman-Turkish sentences (≈ 14 M tokens) from the OTC Corpus (Özateş et al., 2025). This checkpoint is intended as a drop-in encoder for NER task.
Model Details
Property | Value |
---|---|
Base | dbmdz/bert-base-turkish-128k-cased |
Domain data | BUCOLIN/OTC-Corpus |
Pre‐training task | Masked Language Modeling (MLM) |
Epochs | 4 |
Sequence length | 128 tokens (chunked) |
Batch size | 16 (per device) |
Learning rate | 3 × 10⁻⁵ |
Warmup steps | 500 |
Weight decay | 0.01 |
Mixed precision | fp16 |
Checkpoint size | ≈ full weights, fp16 |
Vocabulary | same as base |
Training Data
- Corpus:
BUCOLIN/OTC-Corpus
- 800 K modern‐Latin transliterations of Ottoman-Turkish text
- Pre‐split into train/validation (90 %/10 %) during fine‐tuning
Training
# Args
args = TrainingArguments(
output_dir="BerTurk_Ottoman_Full_DAPT",
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=4,
learning_rate=3e-5,
eval_strategy="epoch",
save_strategy="epoch",
warmup_steps=500,
weight_decay=0.01,
fp16=True,
logging_steps=100,
save_steps=500,
eval_steps=500,
save_total_limit=2,
load_best_model_at_end=True,
)
Hardware & Training
- Hardware: Google Colab Pro (T4 GPU, high VRAM).
- Batch size: 128
- Final Validation Loss | 2.2306
- Total DAPT time: ~ 3 hours for 4 epochs
Test Use
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
# Load model & tokenizer
tokenizer = AutoTokenizer.from_pretrained("cihanunlu/BerTurk_Ottoman_Full_DAPT")
model = AutoModelForMaskedLM.from_pretrained("cihanunlu/BerTurk_Ottoman_Full_DAPT")
nlp = pipeline("fill-mask", model=model, tokenizer=tokenizer)
res = nlp("Devlet-i Aliyye-i Osmaniyye’nin [MASK] için tedâbîr-i mühimme ittikhāz olunmalıdır.")
print(res)
- Downloads last month
- 168
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support