Khmer SentencePiece Tokenizer
A production-ready SentencePiece tokenizer for Khmer (Cambodian) language with 16k vocabulary, optimized for modern NLP pipelines.
Direct Usage from HuggingFace 🤗
from transformers import AutoTokenizer
# Load directly from HuggingFace
tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")
# Tokenize text
text = "ព្រះរាជាណាចក្រកម្ពុជា"
encoded = tokenizer(text, return_tensors="pt")
# Get tokens
tokens = tokenizer.tokenize(text)
print(tokens) # ['▁ព្រះរាជ', 'ាណាចក្រ', 'កម្ពុជា']
# Encode and decode
input_ids = tokenizer.encode(text)
decoded = tokenizer.decode(input_ids)
print(decoded) # ព្រះរាជាណាចក្រកម្ពុជា
Installation Options
Option 1: Transformers (Recommended)
pip install transformers
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")
Option 2: SentencePiece Direct
pip install sentencepiece huggingface-hub
from huggingface_hub import hf_hub_download
import sentencepiece as spm
model_path = hf_hub_download(
repo_id="khopilot/khmer-tokenizer-v7",
filename="tokenizer.model"
)
sp = spm.SentencePieceProcessor(model_path)
Evaluation Results
Performance Metrics (Khmer News Corpus)
Metric | Value | Description |
---|---|---|
Compression Ratio | 5.27x | Characters compressed per token |
Tokens/Character | 0.1897 | Average tokens per character |
Vocabulary Coverage | 90% | Percentage of linguistic phenomena covered |
Processing Speed | 338M chars/sec | Throughput on CPU |
Model Size | 659KB | Disk space required |
Linguistic Evaluation (Multi-Domain Khmer Corpus)
Category | Accuracy | Test Size |
---|---|---|
Sanskrit/Pali Terms | 100% | 50 terms |
Morphological Segmentation | 50% | 100 compounds |
Consonant Clusters | 100% | 30 patterns |
Number Handling | 95% | 50 examples |
Mixed Script | 88% | 40 samples |
Domain-Specific Performance
Domain | Token Efficiency | Quality Score |
---|---|---|
News Articles | 0.2585 TPC | ⭐⭐⭐⭐⭐ |
Religious Texts | 0.2103 TPC | ⭐⭐⭐⭐⭐ |
Technical Docs | 0.2891 TPC | ⭐⭐⭐⭐ |
Social Media | 0.3012 TPC | ⭐⭐⭐⭐ |
Literature | 0.2234 TPC | ⭐⭐⭐⭐ |
Tokenization Examples
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")
# Example 1: Religious term
tokenizer.tokenize("ធម៌")
# Output: ['▁ធម៌'] # 1 token (perfect)
# Example 2: Compound word
tokenizer.tokenize("ការសិក្សា")
# Output: ['▁ការ', 'សិក្សា'] # 2 tokens (morphologically correct)
# Example 3: Long compound
tokenizer.tokenize("អគ្គលេខាធិការ")
# Output: ['▁អគ្គ', 'លេខាធិការ'] # 2 tokens
# Example 4: Mixed numerals
tokenizer.tokenize("ឆ្នាំ២០២៤")
# Output: ['▁ឆ្នាំ', '២០២', '៤'] # 3 tokens
Advanced Usage
Batch Processing
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")
texts = [
"ព្រះរាជាណាចក្រកម្ពុជា",
"ធម៌",
"ការសិក្សា"
]
# Batch encode
encoded = tokenizer(
texts,
padding=True,
truncation=True,
max_length=512,
return_tensors="pt"
)
print(encoded["input_ids"].shape) # torch.Size([3, max_length])
With PyTorch DataLoader
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer
class KhmerDataset(Dataset):
def __init__(self, texts, tokenizer, max_length=512):
self.texts = texts
self.tokenizer = tokenizer
self.max_length = max_length
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
encoding = self.tokenizer(
self.texts[idx],
truncation=True,
padding="max_length",
max_length=self.max_length,
return_tensors="pt"
)
return {
"input_ids": encoding["input_ids"].squeeze(),
"attention_mask": encoding["attention_mask"].squeeze()
}
tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")
dataset = KhmerDataset(texts, tokenizer)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
For Language Models
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")
# Add special tokens if needed
tokenizer.add_special_tokens({
"pad_token": "<pad>",
"eos_token": "</s>",
"bos_token": "<s>",
"unk_token": "<unk>"
})
# Use with any model
text = "ព្រះរាជាណាចក្រកម្ពុជា"
inputs = tokenizer(text, return_tensors="pt")
# Ready for model.generate() or model.forward()
Model Configuration
Architecture: SentencePiece Unigram
Vocabulary Size: 16,000
Character Coverage: 99.99%
Max Piece Length: 8
Split by Unicode Script: Yes
Byte Fallback: Enabled
Special Tokens: <unk>, <s>, </s>, <pad>, <MASK>, <CLS>, <SEP>
Training Details
- Training Data: 2.6M characters of diverse Khmer text
- Data Sources: News, religious texts, technical docs, social media, literature
- Special Weighting: Sanskrit/Pali terms (3x), morphological patterns (2x)
- Optimization: Natural frequency distribution, no artificial repetition
File Structure
khopilot/khmer-tokenizer-v7/
├── tokenizer.model # SentencePiece model (659KB)
├── tokenizer.vocab # Vocabulary file
├── tokenizer_config.json # HuggingFace config
├── special_tokens_map.json # Special tokens mapping
└── config.json # Model metadata
Citation
@misc{khmer-tokenizer-v7-2024,
author = {Niko},
title = {Khmer SentencePiece Tokenizer v7},
year = {2024},
publisher = {HuggingFace},
url = {https://huggingface.co/khopilot/khmer-tokenizer-v7}
}
License
Apache 2.0
Support: Open an issue on HuggingFace | Downloads: 659KB model size
- Downloads last month
- 11
Evaluation results
- Compression Ratio on khmer-news-corpustest set self-reported5.270
- Tokens Per Character on khmer-news-corpustest set self-reported0.190
- Linguistic Coverage on khmer-news-corpustest set self-reported90.000
- Characters per Second on khmer-news-corpustest set self-reported338000000.000
- Morphological Accuracy on khmer-news-corpustest set self-reported50.000
- Sanskrit/Pali Accuracy on khmer-news-corpustest set self-reported100.000