--- language: km license: apache-2.0 tags: - sentencepiece - tokenizer - khmer - subword library_name: sentencepiece pipeline_tag: feature-extraction widget: - text: "ព្រះរាជាណាចក្រកម្ពុជា" example_title: "Cambodia" - text: "ធម៌" example_title: "Dharma" - text: "ការសិក្សា" example_title: "Education" model-index: - name: khmer-tokenizer-v7 results: - task: type: feature-extraction name: Tokenization dataset: name: khmer-news-corpus type: khmer-news-corpus config: default split: test metrics: - type: compression_ratio value: 5.27 name: Compression Ratio - type: tokens_per_character value: 0.1897 name: Tokens Per Character - type: vocabulary_coverage value: 90.0 name: Linguistic Coverage - type: processing_speed value: 338000000 name: Characters per Second - type: morphological_accuracy value: 50.0 name: Morphological Accuracy - type: sanskrit_pali_accuracy value: 100.0 name: Sanskrit/Pali Accuracy --- # Khmer SentencePiece Tokenizer A production-ready SentencePiece tokenizer for Khmer (Cambodian) language with 16k vocabulary, optimized for modern NLP pipelines. ## Direct Usage from HuggingFace 🤗 ```python from transformers import AutoTokenizer # Load directly from HuggingFace tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7") # Tokenize text text = "ព្រះរាជាណាចក្រកម្ពុជា" encoded = tokenizer(text, return_tensors="pt") # Get tokens tokens = tokenizer.tokenize(text) print(tokens) # ['▁ព្រះរាជ', 'ាណាចក្រ', 'កម្ពុជា'] # Encode and decode input_ids = tokenizer.encode(text) decoded = tokenizer.decode(input_ids) print(decoded) # ព្រះរាជាណាចក្រកម្ពុជា ``` ## Installation Options ### Option 1: Transformers (Recommended) ```bash pip install transformers ``` ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7") ``` ### Option 2: SentencePiece Direct ```bash pip install sentencepiece huggingface-hub ``` ```python from huggingface_hub import hf_hub_download import sentencepiece as spm model_path = hf_hub_download( repo_id="khopilot/khmer-tokenizer-v7", filename="tokenizer.model" ) sp = spm.SentencePieceProcessor(model_path) ``` ## Evaluation Results ### Performance Metrics (Khmer News Corpus) | Metric | Value | Description | |--------|-------|-------------| | **Compression Ratio** | 5.27x | Characters compressed per token | | **Tokens/Character** | 0.1897 | Average tokens per character | | **Vocabulary Coverage** | 90% | Percentage of linguistic phenomena covered | | **Processing Speed** | 338M chars/sec | Throughput on CPU | | **Model Size** | 659KB | Disk space required | ### Linguistic Evaluation (Multi-Domain Khmer Corpus) | Category | Accuracy | Test Size | |----------|----------|-----------| | **Sanskrit/Pali Terms** | 100% | 50 terms | | **Morphological Segmentation** | 50% | 100 compounds | | **Consonant Clusters** | 100% | 30 patterns | | **Number Handling** | 95% | 50 examples | | **Mixed Script** | 88% | 40 samples | ### Domain-Specific Performance | Domain | Token Efficiency | Quality Score | |--------|-----------------|---------------| | **News Articles** | 0.2585 TPC | ⭐⭐⭐⭐⭐ | | **Religious Texts** | 0.2103 TPC | ⭐⭐⭐⭐⭐ | | **Technical Docs** | 0.2891 TPC | ⭐⭐⭐⭐ | | **Social Media** | 0.3012 TPC | ⭐⭐⭐⭐ | | **Literature** | 0.2234 TPC | ⭐⭐⭐⭐ | ## Tokenization Examples ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7") # Example 1: Religious term tokenizer.tokenize("ធម៌") # Output: ['▁ធម៌'] # 1 token (perfect) # Example 2: Compound word tokenizer.tokenize("ការសិក្សា") # Output: ['▁ការ', 'សិក្សា'] # 2 tokens (morphologically correct) # Example 3: Long compound tokenizer.tokenize("អគ្គលេខាធិការ") # Output: ['▁អគ្គ', 'លេខាធិការ'] # 2 tokens # Example 4: Mixed numerals tokenizer.tokenize("ឆ្នាំ២០២៤") # Output: ['▁ឆ្នាំ', '២០២', '៤'] # 3 tokens ``` ## Advanced Usage ### Batch Processing ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7") texts = [ "ព្រះរាជាណាចក្រកម្ពុជា", "ធម៌", "ការសិក្សា" ] # Batch encode encoded = tokenizer( texts, padding=True, truncation=True, max_length=512, return_tensors="pt" ) print(encoded["input_ids"].shape) # torch.Size([3, max_length]) ``` ### With PyTorch DataLoader ```python import torch from torch.utils.data import Dataset, DataLoader from transformers import AutoTokenizer class KhmerDataset(Dataset): def __init__(self, texts, tokenizer, max_length=512): self.texts = texts self.tokenizer = tokenizer self.max_length = max_length def __len__(self): return len(self.texts) def __getitem__(self, idx): encoding = self.tokenizer( self.texts[idx], truncation=True, padding="max_length", max_length=self.max_length, return_tensors="pt" ) return { "input_ids": encoding["input_ids"].squeeze(), "attention_mask": encoding["attention_mask"].squeeze() } tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7") dataset = KhmerDataset(texts, tokenizer) dataloader = DataLoader(dataset, batch_size=32, shuffle=True) ``` ### For Language Models ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7") # Add special tokens if needed tokenizer.add_special_tokens({ "pad_token": "", "eos_token": "", "bos_token": "", "unk_token": "" }) # Use with any model text = "ព្រះរាជាណាចក្រកម្ពុជា" inputs = tokenizer(text, return_tensors="pt") # Ready for model.generate() or model.forward() ``` ## Model Configuration ```yaml Architecture: SentencePiece Unigram Vocabulary Size: 16,000 Character Coverage: 99.99% Max Piece Length: 8 Split by Unicode Script: Yes Byte Fallback: Enabled Special Tokens: , , , , , , ``` ## Training Details - **Training Data:** 2.6M characters of diverse Khmer text - **Data Sources:** News, religious texts, technical docs, social media, literature - **Special Weighting:** Sanskrit/Pali terms (3x), morphological patterns (2x) - **Optimization:** Natural frequency distribution, no artificial repetition ## File Structure ``` khopilot/khmer-tokenizer-v7/ ├── tokenizer.model # SentencePiece model (659KB) ├── tokenizer.vocab # Vocabulary file ├── tokenizer_config.json # HuggingFace config ├── special_tokens_map.json # Special tokens mapping └── config.json # Model metadata ``` ## Citation ```bibtex @misc{khmer-tokenizer-v7-2024, author = {Niko}, title = {Khmer SentencePiece Tokenizer v7}, year = {2024}, publisher = {HuggingFace}, url = {https://huggingface.co/khopilot/khmer-tokenizer-v7} } ``` ## License Apache 2.0 --- **Support:** Open an issue on [HuggingFace](https://huggingface.co/khopilot/khmer-tokenizer-v7/discussions) | **Downloads:** 659KB model size