--- license: apache-2.0 language: - kk - ru - en base_model: - google-bert/bert-base-uncased pipeline_tag: fill-mask tags: - pytorch - safetensors library_name: transformers paper: https://doi.org/10.5281/zenodo.15565394 datasets: - amandyk/kazakh_wiki_articles - Eraly-ml/kk-cc-data direct_use: true widget: - text: "KazBERT қазақ тілін [MASK] түсінеді." --- # KazBERT: A Custom BERT Model for the Kazakh Language 🇰🇿
License & Metadata - **License:** apache-2.0 - **Languages:** Kazakh (kk), Russian (ru), English (en) - **Base Model:** google-bert/bert-base-uncased - **Pipeline Tag:** fill-mask - **Tags:** pytorch, safetensors - **Library:** transformers - **Datasets:** - amandyk/kazakh_wiki_articles - Eraly-ml/kk-cc-data - **Direct Use:** ✅ - **Widget Example:** `"KazBERT қазақ тілін [MASK] түсінеді."`
## Model Overview **KazBERT** is a BERT-based model fine-tuned specifically for Kazakh using Masked Language Modeling (MLM). It is based on `bert-base-uncased` and uses a custom tokenizer trained on Kazakh text. ### Model Details - **Architecture:** BERT - **Tokenizer:** WordPiece trained on Kazakh - **Training Data:** Kazakh Wikipedia & Common Crawl - **Method:** Masked Language Modeling (MLM) **Erlanulu, Y. G. (2025). KazBERT: A Custom BERT Model for the Kazakh Language. Zenodo.** 📄 [Read the paper](https://doi.org/10.5281/zenodo.15565394) --- ## Files in Repository - `config.json` – Model config - `model.safetensors` – Model weights - `tokenizer.json` – Tokenizer data - `tokenizer_config.json` – Tokenizer config - `special_tokens_map.json` – Special tokens - `vocab.txt` – Vocabulary --- ## Training Configuration - **Epochs:** 20 - **Batch size:** 16 - **Learning rate:** Default - **Weight decay:** 0.01 - **FP16 Training:** Enabled --- ## Usage Install 🤗 Transformers and load the model: ```python from transformers import BertForMaskedLM, BertTokenizerFast model_name = "Eraly-ml/KazBERT" tokenizer = BertTokenizerFast.from_pretrained(model_name) model = BertForMaskedLM.from_pretrained(model_name) ```` --- ## Example: Masked Token Prediction ```python from transformers import pipeline pipe = pipeline("fill-mask", model="Eraly-ml/KazBERT") output = pipe('KazBERT қазақ тілін [MASK] түсінеді.') ``` **Output:** ```json [ {"score": 0.198, "token_str": "жетік", "sequence": "KazBERT қазақ тілін жетік түсінеді."}, {"score": 0.038, "token_str": "де", "sequence": "KazBERT қазақ тілін де түсінеді."}, {"score": 0.032, "token_str": "терең", "sequence": "KazBERT қазақ тілін терең түсінеді."}, {"score": 0.029, "token_str": "ерте", "sequence": "KazBERT қазақ тілін ерте түсінеді."}, {"score": 0.026, "token_str": "жете", "sequence": "KazBERT қазақ тілін жете түсінеді."} ] ``` --- ## Bias and Limitations ``` - Trained only on public Kazakh Wikipedia & Common Crawl - Might miss informal speech or dialects - Could underperform on deep-context or rare words - May reflect cultural or social biases in data ``` --- ## License Apache 2.0 License --- ## Citation ```bibtex @misc{eraly_gainulla_2025, author = { Eraly Gainulla }, title = { KazBERT (Revision 15240d4) }, year = 2025, url = { https://huggingface.co/Eraly-ml/KazBERT }, doi = { 10.57967/hf/5271 }, publisher = { Hugging Face } } ```