Tajik Language Tokenizers (v1.1)
State-of-the-art tokenizers for the Tajik language, trained on diverse corpus including literature, news and academic texts.
🔍 Tokenization Showcase
Example: Famous Tajik Mathematician
Input text:
"Эргашбой Мирзоевич Муҳамадиев - риёзидони бузурги тоҷик."
Tokenization results:
Model | Tokens |
---|---|
HF BPE | ['Эргаш', 'бой', 'Мирзо', 'евич', 'Му', 'ҳама', 'диев', '-', 'риёзи', 'дони', 'бузурги', 'тоҷик'] |
HF WordPiece | ['Эргаш', '##бой', 'Мирзоев', '##ич', 'Муҳам', '##ади', '##ев', '-', 'риёзидон', '##и', 'бузурги'] |
SP BPE | ['▁Эргаш', 'бой', '▁Мирзо', 'евич', '▁Муҳам', 'ади', 'ев', '▁-', '▁риёзид', 'они', '▁бузурги', '▁тоҷик'] |
Python Usage
from tokenizers import Tokenizer
import sentencepiece as spm
# Load BPE tokenizer
hf_bpe = Tokenizer.from_file("tajik_bpe.json")
sp_bpe = spm.SentencePieceProcessor()
sp_bpe.load("tajik_sp_bpe.model")
# Tokenize sample
text = "Эргашбой Мирзоевич Муҳамадиев - риёзидони бузурги тоҷик."
print("HF BPE:", hf_bpe.encode(text).tokens)
print("SP BPE:", sp_bpe.encode_as_pieces(text))
📊 Performance Benchmarks
Model | Avg Subwords | OOV Rate | Speed (words/sec) |
---|---|---|---|
HF BPE | 1.2 | 0% | 124,961 |
HF WordPiece | 1.22 | 0% | 147,557 |
SP BPE | 1.34 | 0% | 106,239 |
SP Unigram | 1.39 | 0% | 175,230 |
📦 Included Models
- Hugging Face:
- BPE, WordPiece, Unigram tokenizers
- SentencePiece:
- BPE and Unigram models
- Full vocabulary files
🤝 Citation
@misc{TajikTokenizers2025,
author = {Arabov, M.K.},
title = {Tajik Language Tokenizers},
year = 2025,
publisher = {Hugging Face Hub},
url = {https://huggingface.co/ArabovMK/tajik-tokenizers-v1}
}
Last updated: 2025-05-10
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support