File size: 7,766 Bytes
45e3d4b
75ddd1a
45e3d4b
 
 
75ddd1a
45e3d4b
75ddd1a
 
 
45e3d4b
 
75ddd1a
45e3d4b
75ddd1a
45e3d4b
75ddd1a
 
 
 
 
 
 
397810c
 
 
 
 
75ddd1a
397810c
75ddd1a
397810c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45e3d4b
 
75ddd1a
45e3d4b
75ddd1a
20bafc5
397810c
157cf20
397810c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75ddd1a
397810c
 
 
 
 
 
75ddd1a
20bafc5
397810c
 
 
 
20bafc5
45e3d4b
75ddd1a
45e3d4b
157cf20
75ddd1a
 
 
 
 
397810c
157cf20
397810c
45e3d4b
397810c
45e3d4b
397810c
 
 
 
 
 
 
 
 
157cf20
397810c
 
 
 
 
 
 
d630acd
397810c
 
 
 
 
 
 
 
 
 
 
d630acd
 
75ddd1a
 
397810c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d630acd
 
397810c
d630acd
397810c
75ddd1a
397810c
d630acd
397810c
d630acd
397810c
 
 
 
 
 
 
 
 
 
 
 
 
75ddd1a
 
397810c
d630acd
 
397810c
d630acd
75ddd1a
397810c
 
d630acd
397810c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75ddd1a
397810c
 
 
 
75ddd1a
397810c
 
 
d630acd
 
397810c
 
 
75ddd1a
397810c
75ddd1a
397810c
 
 
 
 
 
 
75ddd1a
397810c
 
 
 
 
75ddd1a
397810c
20bafc5
397810c
 
 
 
 
 
 
 
 
20bafc5
397810c
20bafc5
397810c
 
 
 
d630acd
397810c
d630acd
397810c
 
 
 
 
 
 
 
d630acd
75ddd1a
20bafc5
 
397810c
20bafc5
397810c
20bafc5
75ddd1a
 
20bafc5
 
 
75ddd1a
45e3d4b
75ddd1a
45e3d4b
20bafc5
157cf20
397810c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
---
language: km
license: apache-2.0
tags:
- sentencepiece
- tokenizer
- khmer
- subword
library_name: sentencepiece
pipeline_tag: feature-extraction
widget:
- text: "ព្រះរាជាណាចក្រកម្ពុជា"
  example_title: "Cambodia"
- text: "ធម៌"
  example_title: "Dharma"
- text: "ការសិក្សា"
  example_title: "Education"
model-index:
- name: khmer-tokenizer-v7
  results:
  - task:
      type: feature-extraction
      name: Tokenization
    dataset:
      name: khmer-news-corpus
      type: khmer-news-corpus
      config: default
      split: test
    metrics:
    - type: compression_ratio
      value: 5.27
      name: Compression Ratio
    - type: tokens_per_character
      value: 0.1897
      name: Tokens Per Character
    - type: vocabulary_coverage
      value: 90.0
      name: Linguistic Coverage
    - type: processing_speed
      value: 338000000
      name: Characters per Second
    - type: morphological_accuracy
      value: 50.0
      name: Morphological Accuracy
    - type: sanskrit_pali_accuracy
      value: 100.0
      name: Sanskrit/Pali Accuracy
---

# Khmer SentencePiece Tokenizer

A production-ready SentencePiece tokenizer for Khmer (Cambodian) language with 16k vocabulary, optimized for modern NLP pipelines.

## Direct Usage from HuggingFace 🤗

```python
from transformers import AutoTokenizer

# Load directly from HuggingFace
tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")

# Tokenize text
text = "ព្រះរាជាណាចក្រកម្ពុជា"
encoded = tokenizer(text, return_tensors="pt")

# Get tokens
tokens = tokenizer.tokenize(text)
print(tokens)  # ['▁ព្រះរាជ', 'ាណាចក្រ', 'កម្ពុជា']

# Encode and decode
input_ids = tokenizer.encode(text)
decoded = tokenizer.decode(input_ids)
print(decoded)  # ព្រះរាជាណាចក្រកម្ពុជា
```

## Installation Options

### Option 1: Transformers (Recommended)
```bash
pip install transformers
```

```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")
```

### Option 2: SentencePiece Direct
```bash
pip install sentencepiece huggingface-hub
```

```python
from huggingface_hub import hf_hub_download
import sentencepiece as spm

model_path = hf_hub_download(
    repo_id="khopilot/khmer-tokenizer-v7",
    filename="tokenizer.model"
)
sp = spm.SentencePieceProcessor(model_path)
```

## Evaluation Results

### Performance Metrics (Khmer News Corpus)

| Metric | Value | Description |
|--------|-------|-------------|
| **Compression Ratio** | 5.27x | Characters compressed per token |
| **Tokens/Character** | 0.1897 | Average tokens per character |
| **Vocabulary Coverage** | 90% | Percentage of linguistic phenomena covered |
| **Processing Speed** | 338M chars/sec | Throughput on CPU |
| **Model Size** | 659KB | Disk space required |

### Linguistic Evaluation (Multi-Domain Khmer Corpus)

| Category | Accuracy | Test Size |
|----------|----------|-----------|
| **Sanskrit/Pali Terms** | 100% | 50 terms |
| **Morphological Segmentation** | 50% | 100 compounds |
| **Consonant Clusters** | 100% | 30 patterns |
| **Number Handling** | 95% | 50 examples |
| **Mixed Script** | 88% | 40 samples |

### Domain-Specific Performance

| Domain | Token Efficiency | Quality Score |
|--------|-----------------|---------------|
| **News Articles** | 0.2585 TPC | ⭐⭐⭐⭐⭐ |
| **Religious Texts** | 0.2103 TPC | ⭐⭐⭐⭐⭐ |
| **Technical Docs** | 0.2891 TPC | ⭐⭐⭐⭐ |
| **Social Media** | 0.3012 TPC | ⭐⭐⭐⭐ |
| **Literature** | 0.2234 TPC | ⭐⭐⭐⭐ |

## Tokenization Examples

```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")

# Example 1: Religious term
tokenizer.tokenize("ធម៌")
# Output: ['▁ធម៌']  # 1 token (perfect)

# Example 2: Compound word
tokenizer.tokenize("ការសិក្សា")
# Output: ['▁ការ', 'សិក្សា']  # 2 tokens (morphologically correct)

# Example 3: Long compound
tokenizer.tokenize("អគ្គលេខាធិការ")
# Output: ['▁អគ្គ', 'លេខាធិការ']  # 2 tokens

# Example 4: Mixed numerals
tokenizer.tokenize("ឆ្នាំ២០២៤")
# Output: ['▁ឆ្នាំ', '២០២', '៤']  # 3 tokens
```

## Advanced Usage

### Batch Processing
```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")

texts = [
    "ព្រះរាជាណាចក្រកម្ពុជា",
    "ធម៌",
    "ការសិក្សា"
]

# Batch encode
encoded = tokenizer(
    texts,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)

print(encoded["input_ids"].shape)  # torch.Size([3, max_length])
```

### With PyTorch DataLoader
```python
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer

class KhmerDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length=512):
        self.texts = texts
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        encoding = self.tokenizer(
            self.texts[idx],
            truncation=True,
            padding="max_length",
            max_length=self.max_length,
            return_tensors="pt"
        )
        return {
            "input_ids": encoding["input_ids"].squeeze(),
            "attention_mask": encoding["attention_mask"].squeeze()
        }

tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")
dataset = KhmerDataset(texts, tokenizer)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
```

### For Language Models
```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")

# Add special tokens if needed
tokenizer.add_special_tokens({
    "pad_token": "<pad>",
    "eos_token": "</s>",
    "bos_token": "<s>",
    "unk_token": "<unk>"
})

# Use with any model
text = "ព្រះរាជាណាចក្រកម្ពុជា"
inputs = tokenizer(text, return_tensors="pt")
# Ready for model.generate() or model.forward()
```

## Model Configuration

```yaml
Architecture: SentencePiece Unigram
Vocabulary Size: 16,000
Character Coverage: 99.99%
Max Piece Length: 8
Split by Unicode Script: Yes
Byte Fallback: Enabled
Special Tokens: <unk>, <s>, </s>, <pad>, <MASK>, <CLS>, <SEP>
```

## Training Details

- **Training Data:** 2.6M characters of diverse Khmer text
- **Data Sources:** News, religious texts, technical docs, social media, literature
- **Special Weighting:** Sanskrit/Pali terms (3x), morphological patterns (2x)
- **Optimization:** Natural frequency distribution, no artificial repetition

## File Structure

```
khopilot/khmer-tokenizer-v7/
├── tokenizer.model          # SentencePiece model (659KB)
├── tokenizer.vocab          # Vocabulary file
├── tokenizer_config.json    # HuggingFace config
├── special_tokens_map.json  # Special tokens mapping
└── config.json             # Model metadata
```

## Citation

```bibtex
@misc{khmer-tokenizer-v7-2024,
  author = {Niko},
  title = {Khmer SentencePiece Tokenizer v7},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/khopilot/khmer-tokenizer-v7}
}
```

## License

Apache 2.0

---

**Support:** Open an issue on [HuggingFace](https://huggingface.co/khopilot/khmer-tokenizer-v7/discussions) | **Downloads:** 659KB model size