Khmer SentencePiece Tokenizer

A production-ready SentencePiece tokenizer for Khmer (Cambodian) language with 16k vocabulary, optimized for modern NLP pipelines.

Direct Usage from HuggingFace 🤗

from transformers import AutoTokenizer

# Load directly from HuggingFace
tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")

# Tokenize text
text = "ព្រះរាជាណាចក្រកម្ពុជា"
encoded = tokenizer(text, return_tensors="pt")

# Get tokens
tokens = tokenizer.tokenize(text)
print(tokens)  # ['▁ព្រះរាជ', 'ាណាចក្រ', 'កម្ពុជា']

# Encode and decode
input_ids = tokenizer.encode(text)
decoded = tokenizer.decode(input_ids)
print(decoded)  # ព្រះរាជាណាចក្រកម្ពុជា

Installation Options

Option 1: Transformers (Recommended)

pip install transformers

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")

Option 2: SentencePiece Direct

pip install sentencepiece huggingface-hub

from huggingface_hub import hf_hub_download
import sentencepiece as spm

model_path = hf_hub_download(
    repo_id="khopilot/khmer-tokenizer-v7",
    filename="tokenizer.model"
)
sp = spm.SentencePieceProcessor(model_path)

Evaluation Results

Performance Metrics (Khmer News Corpus)

Metric	Value	Description
Compression Ratio	5.27x	Characters compressed per token
Tokens/Character	0.1897	Average tokens per character
Vocabulary Coverage	90%	Percentage of linguistic phenomena covered
Processing Speed	338M chars/sec	Throughput on CPU
Model Size	659KB	Disk space required

Linguistic Evaluation (Multi-Domain Khmer Corpus)

Category	Accuracy	Test Size
Sanskrit/Pali Terms	100%	50 terms
Morphological Segmentation	50%	100 compounds
Consonant Clusters	100%	30 patterns
Number Handling	95%	50 examples
Mixed Script	88%	40 samples

Domain-Specific Performance

Domain	Token Efficiency	Quality Score
News Articles	0.2585 TPC	⭐⭐⭐⭐⭐
Religious Texts	0.2103 TPC	⭐⭐⭐⭐⭐
Technical Docs	0.2891 TPC	⭐⭐⭐⭐
Social Media	0.3012 TPC	⭐⭐⭐⭐
Literature	0.2234 TPC	⭐⭐⭐⭐

Tokenization Examples

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")

# Example 1: Religious term
tokenizer.tokenize("ធម៌")
# Output: ['▁ធម៌']  # 1 token (perfect)

# Example 2: Compound word
tokenizer.tokenize("ការសិក្សា")
# Output: ['▁ការ', 'សិក្សា']  # 2 tokens (morphologically correct)

# Example 3: Long compound
tokenizer.tokenize("អគ្គលេខាធិការ")
# Output: ['▁អគ្គ', 'លេខាធិការ']  # 2 tokens

# Example 4: Mixed numerals
tokenizer.tokenize("ឆ្នាំ២០២៤")
# Output: ['▁ឆ្នាំ', '២០២', '៤']  # 3 tokens

Advanced Usage

Batch Processing

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")

texts = [
    "ព្រះរាជាណាចក្រកម្ពុជា",
    "ធម៌",
    "ការសិក្សា"
]

# Batch encode
encoded = tokenizer(
    texts,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)

print(encoded["input_ids"].shape)  # torch.Size([3, max_length])

With PyTorch DataLoader

import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer

class KhmerDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length=512):
        self.texts = texts
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        encoding = self.tokenizer(
            self.texts[idx],
            truncation=True,
            padding="max_length",
            max_length=self.max_length,
            return_tensors="pt"
        )
        return {
            "input_ids": encoding["input_ids"].squeeze(),
            "attention_mask": encoding["attention_mask"].squeeze()
        }

tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")
dataset = KhmerDataset(texts, tokenizer)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

For Language Models

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")

# Add special tokens if needed
tokenizer.add_special_tokens({
    "pad_token": "<pad>",
    "eos_token": "</s>",
    "bos_token": "<s>",
    "unk_token": "<unk>"
})

# Use with any model
text = "ព្រះរាជាណាចក្រកម្ពុជា"
inputs = tokenizer(text, return_tensors="pt")
# Ready for model.generate() or model.forward()

Model Configuration

Architecture: SentencePiece Unigram
Vocabulary Size: 16,000
Character Coverage: 99.99%
Max Piece Length: 8
Split by Unicode Script: Yes
Byte Fallback: Enabled
Special Tokens: <unk>, <s>, </s>, <pad>, <MASK>, <CLS>, <SEP>

Training Details

Training Data: 2.6M characters of diverse Khmer text
Data Sources: News, religious texts, technical docs, social media, literature
Special Weighting: Sanskrit/Pali terms (3x), morphological patterns (2x)
Optimization: Natural frequency distribution, no artificial repetition

File Structure

khopilot/khmer-tokenizer-v7/
├── tokenizer.model          # SentencePiece model (659KB)
├── tokenizer.vocab          # Vocabulary file
├── tokenizer_config.json    # HuggingFace config
├── special_tokens_map.json  # Special tokens mapping
└── config.json             # Model metadata

Citation

@misc{khmer-tokenizer-v7-2024,
  author = {Niko},
  title = {Khmer SentencePiece Tokenizer v7},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/khopilot/khmer-tokenizer-v7}
}

License

Apache 2.0

Support: Open an issue on HuggingFace | Downloads: 659KB model size

khopilot
/

khmer-tokenizer-v7