Fill-Mask
Transformers
Safetensors
PyTorch
Kazakh
Russian
English
bert

KazBERT: A Custom BERT Model for the Kazakh Language 🇰🇿

License & Metadata
  • License: apache-2.0
  • Languages: Kazakh (kk), Russian (ru), English (en)
  • Base Model: google-bert/bert-base-uncased
  • Pipeline Tag: fill-mask
  • Tags: pytorch, safetensors
  • Library: transformers
  • Datasets:
    • amandyk/kazakh_wiki_articles
    • Eraly-ml/kk-cc-data
  • Direct Use:
  • Widget Example:
    "KazBERT қазақ тілін [MASK] түсінеді."

Model Overview

KazBERT is a BERT-based model fine-tuned specifically for Kazakh using Masked Language Modeling (MLM). It is based on bert-base-uncased and uses a custom tokenizer trained on Kazakh text.

Model Details

  • Architecture: BERT
  • Tokenizer: WordPiece trained on Kazakh
  • Training Data: Kazakh Wikipedia & Common Crawl
  • Method: Masked Language Modeling (MLM)

Erlanulu, Y. G. (2025). KazBERT: A Custom BERT Model for the Kazakh Language. Zenodo. 📄 Read the paper


Files in Repository

  • config.json – Model config
  • model.safetensors – Model weights
  • tokenizer.json – Tokenizer data
  • tokenizer_config.json – Tokenizer config
  • special_tokens_map.json – Special tokens
  • vocab.txt – Vocabulary

Training Configuration

  • Epochs: 20
  • Batch size: 16
  • Learning rate: Default
  • Weight decay: 0.01
  • FP16 Training: Enabled

Usage

Install 🤗 Transformers and load the model:

from transformers import BertForMaskedLM, BertTokenizerFast

model_name = "Eraly-ml/KazBERT"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)

Example: Masked Token Prediction

from transformers import pipeline

pipe = pipeline("fill-mask", model="Eraly-ml/KazBERT")
output = pipe('KazBERT қазақ тілін [MASK] түсінеді.')

Output:

[
  {"score": 0.198, "token_str": "жетік", "sequence": "KazBERT қазақ тілін жетік түсінеді."},
  {"score": 0.038, "token_str": "де", "sequence": "KazBERT қазақ тілін де түсінеді."},
  {"score": 0.032, "token_str": "терең", "sequence": "KazBERT қазақ тілін терең түсінеді."},
  {"score": 0.029, "token_str": "ерте", "sequence": "KazBERT қазақ тілін ерте түсінеді."},
  {"score": 0.026, "token_str": "жете", "sequence": "KazBERT қазақ тілін жете түсінеді."}
]

Bias and Limitations

- Trained only on public Kazakh Wikipedia & Common Crawl
- Might miss informal speech or dialects
- Could underperform on deep-context or rare words
- May reflect cultural or social biases in data

License

Apache 2.0 License


Citation

@misc{eraly_gainulla_2025,
    author       = { Eraly Gainulla },
    title        = { KazBERT (Revision 15240d4) },
    year         = 2025,
    url          = { https://huggingface.co/Eraly-ml/KazBERT },
    doi          = { 10.57967/hf/5271 },
    publisher    = { Hugging Face }
}
Downloads last month
211
Safetensors
Model size
111M params
Tensor type
F32
·
Inference Providers NEW
Examples
Mask token: [MASK]

Model tree for Eraly-ml/KazBERT

Finetuned
(5276)
this model
Finetunes
1 model

Datasets used to train Eraly-ml/KazBERT