TabiBERT

Model Summary
Usage
Evaluation
Limitations
Training
License
Citation

Model Summary

TabiBERT is a modernized encoder-only Transformer model (BERT-style) based on the ModernBERT architecture.
It has been pre-trained on 86 billion tokens of diverse data, primarily:

The first 850B tokens were trained on a union corpus containing Turkish, English, code (with English commentary), and math problems — with about 13% non-Turkish tokens.
The final 150B tokens were trained on a Turkish-only corpus of about 75B tokens (reused over multiple epochs) to specialize the model for Turkish.

TabiBERT inherits ModernBERT’s architectural improvements, such as:

Rotary Positional Embeddings (RoPE) for long-context support.
Local-Global Alternating Attention for efficiency on long inputs.
Unpadding and Flash Attention for efficient inference.

This makes TabiBERT particularly suitable for:

Turkish NLP tasks (classification, QA, retrieval, NLI, etc.).
Multilingual text understanding (Turkish-English; limited English exposure).
Some code and math understanding due to mixed pre-training.
Long-context understanding such as document classification, retrieval, and semantic search.

TabiBERT is built by Tabilab in collaboration with VNGRS.

Usage

You can use TabiBERT directly with the transformers library (v4.48.0+):

pip install -U transformers>=4.48.0

Since TabiBERT is a Masked Language Model (MLM), you can use the fill-mask pipeline or load it via AutoModelForMaskedLM.

⚠️ If your GPU supports it, we recommend using ModernBERT with Flash Attention 2 to reach the highest efficiency. To do so, install Flash Attention as follows, then use the model as normal:

pip install flash-attn

Example usage with AutoModelForMaskedLM:

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_id = "boun-tabilab/TabiBERT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

text = "İstanbul Boğazı [MASK] ve Marmara Denizi'ni birbirine bağlar."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_id = outputs.logits[0, masked_index].argmax(axis=-1)
print("Predicted token:", tokenizer.decode(predicted_id))
# Predicted token:  Karadeniz

Example with pipeline:

from transformers import pipeline

pipe = pipeline("fill-mask", model="boun-tabilab/TabiBERT")

print(pipe("[MASK] Sistemi'ndeki en büyük gezegen Jüpiter'dir."))

Evaluation

Evaluations are in progress.

Limitations

TabiBERT was trained mainly on Turkish, with some English, code, and math exposure. Its performance on English may be limited, and it may underperform on other languages.
As with any large-scale model, it may inherit biases from training data.
While capable of handling up to 8k tokens, inference on very long sequences may be slower.
Still under evaluation — recommended to validate results before deployment in critical applications.

Training

Architecture: Encoder-only, Pre-Norm Transformer with GeGLU activations.
Sequence Length: Pre-trained up to 1,024 tokens, then extended to 8,192 tokens.
Data: 86 billion tokens:
- Pre-trained the first 850B tokens on a union corpus (Turkish, English, code with English commentary, math in English; ~13% non-Turkish)
- Pre-trained the final 150B tokens from Turkish-only 75B-token corpus, repeated over epochs)
Optimizer: StableAdamW with trapezoidal LR scheduling and 1-sqrt decay.
Hardware: Trained on 8x H100 GPUs.

License

Released under the Apache 2.0 license.

Citation

Citation is in progress.

Downloads last month: 108

Safetensors

Model size

0.1B params

Tensor type

F32

boun-tabilab
/

TabiBERT