TabiBERT
Table of Contents
Model Summary
TabiBERT is a modernized encoder-only Transformer model (BERT-style) based on the ModernBERT architecture.
It has been pre-trained on 86 billion tokens of diverse data, primarily:
- The first 850B tokens were trained on a union corpus containing Turkish, English, code (with English commentary), and math problems — with about 13% non-Turkish tokens.
- The final 150B tokens were trained on a Turkish-only corpus of about 75B tokens (reused over multiple epochs) to specialize the model for Turkish.
TabiBERT inherits ModernBERT’s architectural improvements, such as:
- Rotary Positional Embeddings (RoPE) for long-context support.
- Local-Global Alternating Attention for efficiency on long inputs.
- Unpadding and Flash Attention for efficient inference.
This makes TabiBERT particularly suitable for:
- Turkish NLP tasks (classification, QA, retrieval, NLI, etc.).
- Multilingual text understanding (Turkish-English; limited English exposure).
- Some code and math understanding due to mixed pre-training.
- Long-context understanding such as document classification, retrieval, and semantic search.
TabiBERT is built by Tabilab in collaboration with VNGRS.
Usage
You can use TabiBERT directly with the transformers library (v4.48.0+):
pip install -U transformers>=4.48.0
Since TabiBERT is a Masked Language Model (MLM), you can use the fill-mask pipeline or load it via AutoModelForMaskedLM.
⚠️ If your GPU supports it, we recommend using ModernBERT with Flash Attention 2 to reach the highest efficiency. To do so, install Flash Attention as follows, then use the model as normal:
pip install flash-attn
Example usage with AutoModelForMaskedLM:
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_id = "boun-tabilab/TabiBERT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
text = "İstanbul Boğazı [MASK] ve Marmara Denizi'ni birbirine bağlar."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_id = outputs.logits[0, masked_index].argmax(axis=-1)
print("Predicted token:", tokenizer.decode(predicted_id))
# Predicted token: Karadeniz
Example with pipeline:
from transformers import pipeline
pipe = pipeline("fill-mask", model="boun-tabilab/TabiBERT")
print(pipe("[MASK] Sistemi'ndeki en büyük gezegen Jüpiter'dir."))
Evaluation
Evaluations are in progress.
Limitations
- TabiBERT was trained mainly on Turkish, with some English, code, and math exposure. Its performance on English may be limited, and it may underperform on other languages.
- As with any large-scale model, it may inherit biases from training data.
- While capable of handling up to 8k tokens, inference on very long sequences may be slower.
- Still under evaluation — recommended to validate results before deployment in critical applications.
Training
- Architecture: Encoder-only, Pre-Norm Transformer with GeGLU activations.
- Sequence Length: Pre-trained up to 1,024 tokens, then extended to 8,192 tokens.
- Data: 86 billion tokens:
- Pre-trained the first 850B tokens on a union corpus (Turkish, English, code with English commentary, math in English; ~13% non-Turkish)
- Pre-trained the final 150B tokens from Turkish-only 75B-token corpus, repeated over epochs)
- Optimizer: StableAdamW with trapezoidal LR scheduling and 1-sqrt decay.
- Hardware: Trained on 8x H100 GPUs.
License
Released under the Apache 2.0 license.
Citation
Citation is in progress.
- Downloads last month
- 108