KazBERT: A Custom BERT Model for the Kazakh Language 🇰🇿
License & Metadata
- License: apache-2.0
- Languages: Kazakh (kk), Russian (ru), English (en)
- Base Model: google-bert/bert-base-uncased
- Pipeline Tag: fill-mask
- Tags: pytorch, safetensors
- Library: transformers
- Datasets:
- amandyk/kazakh_wiki_articles
- Eraly-ml/kk-cc-data
- Direct Use: ✅
- Widget Example:
"KazBERT қазақ тілін [MASK] түсінеді."
Model Overview
KazBERT is a BERT-based model fine-tuned specifically for Kazakh using Masked Language Modeling (MLM). It is based on bert-base-uncased
and uses a custom tokenizer trained on Kazakh text.
Model Details
- Architecture: BERT
- Tokenizer: WordPiece trained on Kazakh
- Training Data: Kazakh Wikipedia & Common Crawl
- Method: Masked Language Modeling (MLM)
Erlanulu, Y. G. (2025). KazBERT: A Custom BERT Model for the Kazakh Language. Zenodo. 📄 Read the paper
Files in Repository
config.json
– Model configmodel.safetensors
– Model weightstokenizer.json
– Tokenizer datatokenizer_config.json
– Tokenizer configspecial_tokens_map.json
– Special tokensvocab.txt
– Vocabulary
Training Configuration
- Epochs: 20
- Batch size: 16
- Learning rate: Default
- Weight decay: 0.01
- FP16 Training: Enabled
Usage
Install 🤗 Transformers and load the model:
from transformers import BertForMaskedLM, BertTokenizerFast
model_name = "Eraly-ml/KazBERT"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)
Example: Masked Token Prediction
from transformers import pipeline
pipe = pipeline("fill-mask", model="Eraly-ml/KazBERT")
output = pipe('KazBERT қазақ тілін [MASK] түсінеді.')
Output:
[
{"score": 0.198, "token_str": "жетік", "sequence": "KazBERT қазақ тілін жетік түсінеді."},
{"score": 0.038, "token_str": "де", "sequence": "KazBERT қазақ тілін де түсінеді."},
{"score": 0.032, "token_str": "терең", "sequence": "KazBERT қазақ тілін терең түсінеді."},
{"score": 0.029, "token_str": "ерте", "sequence": "KazBERT қазақ тілін ерте түсінеді."},
{"score": 0.026, "token_str": "жете", "sequence": "KazBERT қазақ тілін жете түсінеді."}
]
Bias and Limitations
- Trained only on public Kazakh Wikipedia & Common Crawl
- Might miss informal speech or dialects
- Could underperform on deep-context or rare words
- May reflect cultural or social biases in data
License
Apache 2.0 License
Citation
@misc{eraly_gainulla_2025,
author = { Eraly Gainulla },
title = { KazBERT (Revision 15240d4) },
year = 2025,
url = { https://huggingface.co/Eraly-ml/KazBERT },
doi = { 10.57967/hf/5271 },
publisher = { Hugging Face }
}
- Downloads last month
- 211