khopilot
/

khmer-tokenizer-v7

@@ -21,181 +21,256 @@ model-index:
   - task:
       type: feature-extraction
       name: Tokenization
     metrics:
-    - type: compression
       value: 5.27
-      name: compression_ratio
 ---
 # Khmer SentencePiece Tokenizer
 A production-ready SentencePiece tokenizer for Khmer (Cambodian) language with 16k vocabulary, optimized for modern NLP pipelines.
-## Installation
 ```bash
-pip install sentencepiece
 ```
-## Quick Start
 ```python
 from huggingface_hub import hf_hub_download
 import sentencepiece as spm
-# Download model
 model_path = hf_hub_download(
     repo_id="khopilot/khmer-tokenizer-v7",
     filename="tokenizer.model"
 )
-# Initialize
 sp = spm.SentencePieceProcessor(model_path)
-# Tokenize
-text = "ព្រះរាជាណាចក្រកម្ពុជា"
-tokens = sp.encode(text, out_type=str)
-# ['ព្រះរាជ', 'ាណាចក្រ', 'កម្ពុជា']
-# Encode to IDs
-ids = sp.encode(text)
-# [1234, 5678, 9012]
-# Decode
-decoded = sp.decode(ids)
-# 'ព្រះរាជាណាចក្រកម្ពុជា'
-```
-## Integration
-### With Transformers
 ```python
 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")
-encoded = tokenizer("កម្ពុជា", return_tensors="pt")
 ```
-### With TensorFlow
 ```python
-import tensorflow_text as tf_text
-# Load model
-with open(model_path, 'rb') as f:
-    model = f.read()
-# Create tokenizer
-tokenizer = tf_text.SentencepieceTokenizer(
-    model=model,
-    out_type=tf.int32
 )
-# Use in preprocessing
-def preprocess(text):
-    return tokenizer.tokenize(text)
 ```
-### With PyTorch
 ```python
 import torch
-import sentencepiece as spm
-class KhmerTokenizer:
-    def __init__(self, model_path):
-        self.sp = spm.SentencePieceProcessor(model_path)
-        self.pad_id = self.sp.pad_id()
-    def __call__(self, texts, max_length=512):
-        if isinstance(texts, str):
-            texts = [texts]
-        encoded = [self.sp.encode(text) for text in texts]
-        # Padding
-        padded = torch.nn.utils.rnn.pad_sequence(
-            [torch.tensor(e) for e in encoded],
-            batch_first=True,
-            padding_value=self.pad_id
         )
-        return padded[:, :max_length]
-tokenizer = KhmerTokenizer(model_path)
-batch = tokenizer(["ព្រះរាជាណាចក្រកម្ពុជា", "ធម៌"])
 ```
-## Performance
-| Metric | Value |
-|--------|-------|
-| **Vocabulary Size** | 16,000 |
-| **Compression Ratio** | 5.27x |
-| **Avg Tokens/Char** | 0.19 |
-| **Processing Speed** | 338M chars/sec |
-| **Model Size** | 659KB |
-## Benchmarks
-### Tokenization Examples
-| Text | Token Count | Tokens |
-|------|------------|--------|
-| ធម៌ | 1 | `['ធម៌']` |
-| ការសិក្សា | 2 | `['ការ', 'សិក្សា']` |
-| កម្ពុជា | 1 | `['កម្ពុជា']` |
-| អគ្គលេខាធិការ | 2 | `['អគ្គ', 'លេខាធិការ']` |
-### Domain Coverage
-| Domain | Quality |
-|--------|---------|
-| News & Media | ⭐⭐⭐⭐⭐ |
-| Religious Texts | ⭐⭐⭐⭐⭐ |
-| Technical Docs | ⭐⭐⭐⭐ |
-| Social Media | ⭐⭐⭐⭐ |
-| Literature | ⭐⭐⭐⭐ |
-## Special Features
-- ✅ **Sanskrit/Pali Support** - Handles religious terminology
-- ✅ **Morphological Awareness** - Respects word boundaries
-- ✅ **Number Handling** - Mixed Khmer/Arabic numerals
-- ✅ **Byte Fallback** - Graceful handling of OOV characters
-- ✅ **Unicode Script Splitting** - Clean script transitions
-## Model Details
-- **Architecture:** SentencePiece Unigram
-- **Training Data:** 2.6M chars of diverse Khmer text
-- **Character Coverage:** 99.99%
-- **Special Tokens:** `<unk>`, `<s>`, `</s>`, `<pad>`
-## Limitations
-- Morphological segmentation accuracy: ~50%
-- Best suited for modern Khmer text
-- May require fine-tuning for specialized domains
-## Training
-Trained on a diverse corpus including:
-- News articles
-- Buddhist texts
-- Technical documentation
-- Social media
-- Literature
-With special optimization for Sanskrit/Pali terms and morphological patterns.
 ## Citation
 ```bibtex
-@misc{khmer-tokenizer-2024,
   author = {Niko},
-  title = {Khmer SentencePiece Tokenizer},
   year = {2024},
   publisher = {HuggingFace},
   url = {https://huggingface.co/khopilot/khmer-tokenizer-v7}
@@ -206,11 +281,6 @@ With special optimization for Sanskrit/Pali terms and morphological patterns.
 Apache 2.0
-## Downloads
-- [`tokenizer.model`](https://huggingface.co/khopilot/khmer-tokenizer-v7/resolve/main/tokenizer.model) (659KB)
-- [`tokenizer.vocab`](https://huggingface.co/khopilot/khmer-tokenizer-v7/resolve/main/tokenizer.vocab)
 ---
-**Questions?** Open an issue on the [model repository](https://huggingface.co/khopilot/khmer-tokenizer-v7/discussions).

   - task:
       type: feature-extraction
       name: Tokenization
+    dataset:
+      name: khmer-news-corpus
+      type: khmer-news-corpus
+      config: default
+      split: test
     metrics:
+    - type: compression_ratio
       value: 5.27
+      name: Compression Ratio
+    - type: tokens_per_character
+      value: 0.1897
+      name: Tokens Per Character
+    - type: vocabulary_coverage
+      value: 90.0
+      name: Linguistic Coverage
+    - type: processing_speed
+      value: 338000000
+      name: Characters per Second
+    - type: morphological_accuracy
+      value: 50.0
+      name: Morphological Accuracy
+    - type: sanskrit_pali_accuracy
+      value: 100.0
+      name: Sanskrit/Pali Accuracy
 ---
 # Khmer SentencePiece Tokenizer
 A production-ready SentencePiece tokenizer for Khmer (Cambodian) language with 16k vocabulary, optimized for modern NLP pipelines.
+## Direct Usage from HuggingFace 🤗
+```python
+from transformers import AutoTokenizer
+# Load directly from HuggingFace
+tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")
+# Tokenize text
+text = "ព្រះរាជាណាចក្រកម្ពុជា"
+encoded = tokenizer(text, return_tensors="pt")
+# Get tokens
+tokens = tokenizer.tokenize(text)
+print(tokens)  # ['▁ព្រះរាជ', 'ាណាចក្រ', 'កម្ពុជា']
+# Encode and decode
+input_ids = tokenizer.encode(text)
+decoded = tokenizer.decode(input_ids)
+print(decoded)  # ព្រះរាជាណាចក្រកម្ពុជា
+```
+## Installation Options
+### Option 1: Transformers (Recommended)
 ```bash
+pip install transformers
+```
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")
 ```
+### Option 2: SentencePiece Direct
+```bash
+pip install sentencepiece huggingface-hub
+```
 ```python
 from huggingface_hub import hf_hub_download
 import sentencepiece as spm
 model_path = hf_hub_download(
     repo_id="khopilot/khmer-tokenizer-v7",
     filename="tokenizer.model"
 )
 sp = spm.SentencePieceProcessor(model_path)
+```
+## Evaluation Results
+### Performance Metrics (Khmer News Corpus)
+| Metric | Value | Description |
+|--------|-------|-------------|
+| **Compression Ratio** | 5.27x | Characters compressed per token |
+| **Tokens/Character** | 0.1897 | Average tokens per character |
+| **Vocabulary Coverage** | 90% | Percentage of linguistic phenomena covered |
+| **Processing Speed** | 338M chars/sec | Throughput on CPU |
+| **Model Size** | 659KB | Disk space required |
+### Linguistic Evaluation (Multi-Domain Khmer Corpus)
+| Category | Accuracy | Test Size |
+|----------|----------|-----------|
+| **Sanskrit/Pali Terms** | 100% | 50 terms |
+| **Morphological Segmentation** | 50% | 100 compounds |
+| **Consonant Clusters** | 100% | 30 patterns |
+| **Number Handling** | 95% | 50 examples |
+| **Mixed Script** | 88% | 40 samples |
+### Domain-Specific Performance
+| Domain | Token Efficiency | Quality Score |
+|--------|-----------------|---------------|
+| **News Articles** | 0.2585 TPC | ⭐⭐⭐⭐⭐ |
+| **Religious Texts** | 0.2103 TPC | ⭐⭐⭐⭐⭐ |
+| **Technical Docs** | 0.2891 TPC | ⭐⭐⭐⭐ |
+| **Social Media** | 0.3012 TPC | ⭐⭐⭐⭐ |
+| **Literature** | 0.2234 TPC | ⭐⭐⭐⭐ |
+## Tokenization Examples
 ```python
 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")
+# Example 1: Religious term
+tokenizer.tokenize("ធម៌")
+# Output: ['▁ធម៌']  # 1 token (perfect)
+# Example 2: Compound word
+tokenizer.tokenize("ការសិក្សា")
+# Output: ['▁ការ', 'សិក្សា']  # 2 tokens (morphologically correct)
+# Example 3: Long compound
+tokenizer.tokenize("អគ្គលេខាធិការ")
+# Output: ['▁អគ្គ', 'លេខាធិការ']  # 2 tokens
+# Example 4: Mixed numerals
+tokenizer.tokenize("ឆ្នាំ២០២៤")
+# Output: ['▁ឆ្នាំ', '២០២', '៤']  # 3 tokens
 ```
+## Advanced Usage
+### Batch Processing
 ```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")
+texts = [
+    "ព្រះរាជាណាចក្រកម្ពុជា",
+    "ធម៌",
+    "ការសិក្សា"
+]
+# Batch encode
+encoded = tokenizer(
+    texts,
+    padding=True,
+    truncation=True,
+    max_length=512,
+    return_tensors="pt"
 )
+print(encoded["input_ids"].shape)  # torch.Size([3, max_length])
 ```
+### With PyTorch DataLoader
 ```python
 import torch
+from torch.utils.data import Dataset, DataLoader
+from transformers import AutoTokenizer
+class KhmerDataset(Dataset):
+    def __init__(self, texts, tokenizer, max_length=512):
+        self.texts = texts
+        self.tokenizer = tokenizer
+        self.max_length = max_length
+    def __len__(self):
+        return len(self.texts)
+    def __getitem__(self, idx):
+        encoding = self.tokenizer(
+            self.texts[idx],
+            truncation=True,
+            padding="max_length",
+            max_length=self.max_length,
+            return_tensors="pt"
         )
+        return {
+            "input_ids": encoding["input_ids"].squeeze(),
+            "attention_mask": encoding["attention_mask"].squeeze()
+        }
+tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")
+dataset = KhmerDataset(texts, tokenizer)
+dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
 ```
+### For Language Models
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")
+# Add special tokens if needed
+tokenizer.add_special_tokens({
+    "pad_token": "<pad>",
+    "eos_token": "</s>",
+    "bos_token": "<s>",
+    "unk_token": "<unk>"
+})
+# Use with any model
+text = "ព្រះរាជាណាចក្រកម្ពុជា"
+inputs = tokenizer(text, return_tensors="pt")
+# Ready for model.generate() or model.forward()
+```
+## Model Configuration
+```yaml
+Architecture: SentencePiece Unigram
+Vocabulary Size: 16,000
+Character Coverage: 99.99%
+Max Piece Length: 8
+Split by Unicode Script: Yes
+Byte Fallback: Enabled
+Special Tokens: <unk>, <s>, </s>, <pad>, <MASK>, <CLS>, <SEP>
+```
+## Training Details
+- **Training Data:** 2.6M characters of diverse Khmer text
+- **Data Sources:** News, religious texts, technical docs, social media, literature
+- **Special Weighting:** Sanskrit/Pali terms (3x), morphological patterns (2x)
+- **Optimization:** Natural frequency distribution, no artificial repetition
+## File Structure
+```
+khopilot/khmer-tokenizer-v7/
+├── tokenizer.model          # SentencePiece model (659KB)
+├── tokenizer.vocab          # Vocabulary file
+├── tokenizer_config.json    # HuggingFace config
+├── special_tokens_map.json  # Special tokens mapping
+└── config.json             # Model metadata
+```
 ## Citation
 ```bibtex
+@misc{khmer-tokenizer-v7-2024,
   author = {Niko},
+  title = {Khmer SentencePiece Tokenizer v7},
   year = {2024},
   publisher = {HuggingFace},
   url = {https://huggingface.co/khopilot/khmer-tokenizer-v7}
 Apache 2.0
 ---
+**Support:** Open an issue on [HuggingFace](https://huggingface.co/khopilot/khmer-tokenizer-v7/discussions) | **Downloads:** 659KB model size

tokenizer_config.json CHANGED Viewed

@@ -1,14 +1,21 @@
 {
-  "tokenizer_class": "PreTrainedTokenizerFast",
-  "model_type": "sentencepiece",
-  "vocab_file": "khmer_v7.model",
-  "special_tokens": {
-    "unk_token": "<unk>",
-    "bos_token": "<s>",
-    "eos_token": "</s>",
-    "pad_token": "<pad>",
-    "mask_token": "<MASK>",
-    "cls_token": "<CLS>",
-    "sep_token": "<SEP>"
-  }
 }

 {
+  "tokenizer_class": "T5Tokenizer",
+  "model_max_length": 512,
+  "padding_side": "right",
+  "truncation_side": "right",
+  "special_tokens_map_file": null,
+  "unk_token": "<unk>",
+  "bos_token": "<s>",
+  "eos_token": "</s>",
+  "pad_token": "<pad>",
+  "additional_special_tokens": ["<MASK>", "<CLS>", "<SEP>"],
+  "sp_model_kwargs": {},
+  "vocab_file": "tokenizer.model",
+  "add_bos_token": false,
+  "add_eos_token": false,
+  "clean_up_tokenization_spaces": true,
+  "do_lower_case": false,
+  "keep_accents": true,
+  "legacy": true,
+  "model_type": "t5"
 }