TabiBERT

Table of Contents

  1. Model Summary
  2. Usage
  3. Evaluation
  4. Limitations
  5. Training
  6. License
  7. Citation

Model Summary

TabiBERT is a modernized encoder-only Transformer model (BERT-style) based on the ModernBERT architecture.
It has been pre-trained on 86 billion tokens of diverse data, primarily:

  • The first 850B tokens were trained on a union corpus containing Turkish, English, code (with English commentary), and math problems — with about 13% non-Turkish tokens.
  • The final 150B tokens were trained on a Turkish-only corpus of about 75B tokens (reused over multiple epochs) to specialize the model for Turkish.

TabiBERT inherits ModernBERT’s architectural improvements, such as:

  • Rotary Positional Embeddings (RoPE) for long-context support.
  • Local-Global Alternating Attention for efficiency on long inputs.
  • Unpadding and Flash Attention for efficient inference.

This makes TabiBERT particularly suitable for:

  • Turkish NLP tasks (classification, QA, retrieval, NLI, etc.).
  • Multilingual text understanding (Turkish-English; limited English exposure).
  • Some code and math understanding due to mixed pre-training.
  • Long-context understanding such as document classification, retrieval, and semantic search.

TabiBERT is built by Tabilab in collaboration with VNGRS.


Usage

You can use TabiBERT directly with the transformers library (v4.48.0+):

pip install -U transformers>=4.48.0

Since TabiBERT is a Masked Language Model (MLM), you can use the fill-mask pipeline or load it via AutoModelForMaskedLM.

⚠️ If your GPU supports it, we recommend using ModernBERT with Flash Attention 2 to reach the highest efficiency. To do so, install Flash Attention as follows, then use the model as normal:

pip install flash-attn

Example usage with AutoModelForMaskedLM:

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_id = "boun-tabilab/TabiBERT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

text = "İstanbul Boğazı [MASK] ve Marmara Denizi'ni birbirine bağlar."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_id = outputs.logits[0, masked_index].argmax(axis=-1)
print("Predicted token:", tokenizer.decode(predicted_id))
# Predicted token:  Karadeniz

Example with pipeline:

from transformers import pipeline

pipe = pipeline("fill-mask", model="boun-tabilab/TabiBERT")

print(pipe("[MASK] Sistemi'ndeki en büyük gezegen Jüpiter'dir."))

Evaluation

Evaluations are in progress.

Limitations

  • TabiBERT was trained mainly on Turkish, with some English, code, and math exposure. Its performance on English may be limited, and it may underperform on other languages.
  • As with any large-scale model, it may inherit biases from training data.
  • While capable of handling up to 8k tokens, inference on very long sequences may be slower.
  • Still under evaluation — recommended to validate results before deployment in critical applications.

Training

  • Architecture: Encoder-only, Pre-Norm Transformer with GeGLU activations.
  • Sequence Length: Pre-trained up to 1,024 tokens, then extended to 8,192 tokens.
  • Data: 86 billion tokens:
    • Pre-trained the first 850B tokens on a union corpus (Turkish, English, code with English commentary, math in English; ~13% non-Turkish)
    • Pre-trained the final 150B tokens from Turkish-only 75B-token corpus, repeated over epochs)
  • Optimizer: StableAdamW with trapezoidal LR scheduling and 1-sqrt decay.
  • Hardware: Trained on 8x H100 GPUs.

License

Released under the Apache 2.0 license.

Citation

Citation is in progress.

Downloads last month
108
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support