Tajik Language Tokenizers (v1.1)

State-of-the-art tokenizers for the Tajik language, trained on diverse corpus including literature, news and academic texts.

🔍 Tokenization Showcase

Example: Famous Tajik Mathematician

Input text:
"Эргашбой Мирзоевич Муҳамадиев - риёзидони бузурги тоҷик."

Tokenization results:

Model Tokens
HF BPE ['Эргаш', 'бой', 'Мирзо', 'евич', 'Му', 'ҳама', 'диев', '-', 'риёзи', 'дони', 'бузурги', 'тоҷик']
HF WordPiece ['Эргаш', '##бой', 'Мирзоев', '##ич', 'Муҳам', '##ади', '##ев', '-', 'риёзидон', '##и', 'бузурги']
SP BPE ['▁Эргаш', 'бой', '▁Мирзо', 'евич', '▁Муҳам', 'ади', 'ев', '▁-', '▁риёзид', 'они', '▁бузурги', '▁тоҷик']

Python Usage

from tokenizers import Tokenizer
import sentencepiece as spm

# Load BPE tokenizer
hf_bpe = Tokenizer.from_file("tajik_bpe.json")
sp_bpe = spm.SentencePieceProcessor()
sp_bpe.load("tajik_sp_bpe.model")

# Tokenize sample
text = "Эргашбой Мирзоевич Муҳамадиев - риёзидони бузурги тоҷик."
print("HF BPE:", hf_bpe.encode(text).tokens)
print("SP BPE:", sp_bpe.encode_as_pieces(text))

📊 Performance Benchmarks

Model Avg Subwords OOV Rate Speed (words/sec)
HF BPE 1.2 0% 124,961
HF WordPiece 1.22 0% 147,557
SP BPE 1.34 0% 106,239
SP Unigram 1.39 0% 175,230

📦 Included Models

  • Hugging Face:
    • BPE, WordPiece, Unigram tokenizers
  • SentencePiece:
    • BPE and Unigram models
  • Full vocabulary files

🤝 Citation

@misc{TajikTokenizers2025,
  author = {Arabov, M.K.},
  title = {Tajik Language Tokenizers},
  year = 2025,
  publisher = {Hugging Face Hub},
  url = {https://huggingface.co/ArabovMK/tajik-tokenizers-v1}
}

Last updated: 2025-05-10

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support