---
license: apache-2.0
language:
- kk
- ru
- en
base_model:
- google-bert/bert-base-uncased
pipeline_tag: fill-mask
tags:
- pytorch
- safetensors
library_name: transformers
paper: https://doi.org/10.5281/zenodo.15565394
datasets:
- amandyk/kazakh_wiki_articles
- Eraly-ml/kk-cc-data
direct_use: true
widget:
- text: "KazBERT қазақ тілін [MASK] түсінеді."
---
# KazBERT: A Custom BERT Model for the Kazakh Language 🇰🇿
License & Metadata
- **License:** apache-2.0
- **Languages:** Kazakh (kk), Russian (ru), English (en)
- **Base Model:** google-bert/bert-base-uncased
- **Pipeline Tag:** fill-mask
- **Tags:** pytorch, safetensors
- **Library:** transformers
- **Datasets:**
- amandyk/kazakh_wiki_articles
- Eraly-ml/kk-cc-data
- **Direct Use:** ✅
- **Widget Example:**
`"KazBERT қазақ тілін [MASK] түсінеді."`
## Model Overview
**KazBERT** is a BERT-based model fine-tuned specifically for Kazakh using Masked Language Modeling (MLM). It is based on `bert-base-uncased` and uses a custom tokenizer trained on Kazakh text.
### Model Details
- **Architecture:** BERT
- **Tokenizer:** WordPiece trained on Kazakh
- **Training Data:** Kazakh Wikipedia & Common Crawl
- **Method:** Masked Language Modeling (MLM)
**Erlanulu, Y. G. (2025). KazBERT: A Custom BERT Model for the Kazakh Language. Zenodo.**
📄 [Read the paper](https://doi.org/10.5281/zenodo.15565394)
---
## Files in Repository
- `config.json` – Model config
- `model.safetensors` – Model weights
- `tokenizer.json` – Tokenizer data
- `tokenizer_config.json` – Tokenizer config
- `special_tokens_map.json` – Special tokens
- `vocab.txt` – Vocabulary
---
## Training Configuration
- **Epochs:** 20
- **Batch size:** 16
- **Learning rate:** Default
- **Weight decay:** 0.01
- **FP16 Training:** Enabled
---
## Usage
Install 🤗 Transformers and load the model:
```python
from transformers import BertForMaskedLM, BertTokenizerFast
model_name = "Eraly-ml/KazBERT"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)
````
---
## Example: Masked Token Prediction
```python
from transformers import pipeline
pipe = pipeline("fill-mask", model="Eraly-ml/KazBERT")
output = pipe('KazBERT қазақ тілін [MASK] түсінеді.')
```
**Output:**
```json
[
{"score": 0.198, "token_str": "жетік", "sequence": "KazBERT қазақ тілін жетік түсінеді."},
{"score": 0.038, "token_str": "де", "sequence": "KazBERT қазақ тілін де түсінеді."},
{"score": 0.032, "token_str": "терең", "sequence": "KazBERT қазақ тілін терең түсінеді."},
{"score": 0.029, "token_str": "ерте", "sequence": "KazBERT қазақ тілін ерте түсінеді."},
{"score": 0.026, "token_str": "жете", "sequence": "KazBERT қазақ тілін жете түсінеді."}
]
```
---
## Bias and Limitations
```
- Trained only on public Kazakh Wikipedia & Common Crawl
- Might miss informal speech or dialects
- Could underperform on deep-context or rare words
- May reflect cultural or social biases in data
```
---
## License
Apache 2.0 License
---
## Citation
```bibtex
@misc{eraly_gainulla_2025,
author = { Eraly Gainulla },
title = { KazBERT (Revision 15240d4) },
year = 2025,
url = { https://huggingface.co/Eraly-ml/KazBERT },
doi = { 10.57967/hf/5271 },
publisher = { Hugging Face }
}
```