Upload folder using huggingface_hub
Browse files- README.md +189 -119
- tokenizer_config.json +19 -12
README.md
CHANGED
@@ -21,181 +21,256 @@ model-index:
|
|
21 |
- task:
|
22 |
type: feature-extraction
|
23 |
name: Tokenization
|
|
|
|
|
|
|
|
|
|
|
24 |
metrics:
|
25 |
-
- type:
|
26 |
value: 5.27
|
27 |
-
name:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
---
|
29 |
|
30 |
# Khmer SentencePiece Tokenizer
|
31 |
|
32 |
A production-ready SentencePiece tokenizer for Khmer (Cambodian) language with 16k vocabulary, optimized for modern NLP pipelines.
|
33 |
|
34 |
-
##
|
35 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
36 |
```bash
|
37 |
-
pip install
|
|
|
|
|
|
|
|
|
|
|
38 |
```
|
39 |
|
40 |
-
|
|
|
|
|
|
|
41 |
|
42 |
```python
|
43 |
from huggingface_hub import hf_hub_download
|
44 |
import sentencepiece as spm
|
45 |
|
46 |
-
# Download model
|
47 |
model_path = hf_hub_download(
|
48 |
repo_id="khopilot/khmer-tokenizer-v7",
|
49 |
filename="tokenizer.model"
|
50 |
)
|
51 |
-
|
52 |
-
# Initialize
|
53 |
sp = spm.SentencePieceProcessor(model_path)
|
|
|
54 |
|
55 |
-
|
56 |
-
text = "ព្រះរាជាណាចក្រកម្ពុជា"
|
57 |
-
tokens = sp.encode(text, out_type=str)
|
58 |
-
# ['ព្រះរាជ', 'ាណាចក្រ', 'កម្ពុជា']
|
59 |
|
60 |
-
|
61 |
-
ids = sp.encode(text)
|
62 |
-
# [1234, 5678, 9012]
|
63 |
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
|
|
|
|
|
|
|
|
|
|
68 |
|
69 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
70 |
|
71 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
72 |
|
73 |
```python
|
74 |
from transformers import AutoTokenizer
|
75 |
-
|
76 |
tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")
|
77 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
78 |
```
|
79 |
|
80 |
-
|
81 |
|
|
|
82 |
```python
|
83 |
-
|
84 |
|
85 |
-
|
86 |
-
with open(model_path, 'rb') as f:
|
87 |
-
model = f.read()
|
88 |
|
89 |
-
|
90 |
-
|
91 |
-
|
92 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
93 |
)
|
94 |
|
95 |
-
#
|
96 |
-
def preprocess(text):
|
97 |
-
return tokenizer.tokenize(text)
|
98 |
```
|
99 |
|
100 |
-
### With PyTorch
|
101 |
-
|
102 |
```python
|
103 |
import torch
|
104 |
-
import
|
|
|
105 |
|
106 |
-
class
|
107 |
-
def __init__(self,
|
108 |
-
self.
|
109 |
-
self.
|
110 |
-
|
111 |
-
|
112 |
-
|
113 |
-
|
114 |
-
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
-
|
119 |
-
|
120 |
-
|
121 |
-
|
122 |
)
|
123 |
-
|
124 |
-
|
|
|
|
|
125 |
|
126 |
-
tokenizer =
|
127 |
-
|
|
|
128 |
```
|
129 |
|
130 |
-
|
131 |
-
|
132 |
-
|
133 |
-
|--------|-------|
|
134 |
-
| **Vocabulary Size** | 16,000 |
|
135 |
-
| **Compression Ratio** | 5.27x |
|
136 |
-
| **Avg Tokens/Char** | 0.19 |
|
137 |
-
| **Processing Speed** | 338M chars/sec |
|
138 |
-
| **Model Size** | 659KB |
|
139 |
-
|
140 |
-
## Benchmarks
|
141 |
-
|
142 |
-
### Tokenization Examples
|
143 |
-
|
144 |
-
| Text | Token Count | Tokens |
|
145 |
-
|------|------------|--------|
|
146 |
-
| ធម៌ | 1 | `['ធម៌']` |
|
147 |
-
| ការសិក្សា | 2 | `['ការ', 'សិក្សា']` |
|
148 |
-
| កម្ពុជា | 1 | `['កម្ពុជា']` |
|
149 |
-
| អគ្គលេខាធិការ | 2 | `['អគ្គ', 'លេខាធិការ']` |
|
150 |
-
|
151 |
-
### Domain Coverage
|
152 |
-
|
153 |
-
| Domain | Quality |
|
154 |
-
|--------|---------|
|
155 |
-
| News & Media | ⭐⭐⭐⭐⭐ |
|
156 |
-
| Religious Texts | ⭐⭐⭐⭐⭐ |
|
157 |
-
| Technical Docs | ⭐⭐⭐⭐ |
|
158 |
-
| Social Media | ⭐⭐⭐⭐ |
|
159 |
-
| Literature | ⭐⭐⭐⭐ |
|
160 |
|
161 |
-
|
162 |
|
163 |
-
|
164 |
-
|
165 |
-
|
166 |
-
|
167 |
-
|
|
|
|
|
168 |
|
169 |
-
|
|
|
|
|
|
|
|
|
170 |
|
171 |
-
|
172 |
-
- **Training Data:** 2.6M chars of diverse Khmer text
|
173 |
-
- **Character Coverage:** 99.99%
|
174 |
-
- **Special Tokens:** `<unk>`, `<s>`, `</s>`, `<pad>`
|
175 |
|
176 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
177 |
|
178 |
-
|
179 |
-
- Best suited for modern Khmer text
|
180 |
-
- May require fine-tuning for specialized domains
|
181 |
|
182 |
-
|
|
|
|
|
|
|
183 |
|
184 |
-
|
185 |
-
- News articles
|
186 |
-
- Buddhist texts
|
187 |
-
- Technical documentation
|
188 |
-
- Social media
|
189 |
-
- Literature
|
190 |
|
191 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
192 |
|
193 |
## Citation
|
194 |
|
195 |
```bibtex
|
196 |
-
@misc{khmer-tokenizer-2024,
|
197 |
author = {Niko},
|
198 |
-
title = {Khmer SentencePiece Tokenizer},
|
199 |
year = {2024},
|
200 |
publisher = {HuggingFace},
|
201 |
url = {https://huggingface.co/khopilot/khmer-tokenizer-v7}
|
@@ -206,11 +281,6 @@ With special optimization for Sanskrit/Pali terms and morphological patterns.
|
|
206 |
|
207 |
Apache 2.0
|
208 |
|
209 |
-
## Downloads
|
210 |
-
|
211 |
-
- [`tokenizer.model`](https://huggingface.co/khopilot/khmer-tokenizer-v7/resolve/main/tokenizer.model) (659KB)
|
212 |
-
- [`tokenizer.vocab`](https://huggingface.co/khopilot/khmer-tokenizer-v7/resolve/main/tokenizer.vocab)
|
213 |
-
|
214 |
---
|
215 |
|
216 |
-
**
|
|
|
21 |
- task:
|
22 |
type: feature-extraction
|
23 |
name: Tokenization
|
24 |
+
dataset:
|
25 |
+
name: khmer-news-corpus
|
26 |
+
type: khmer-news-corpus
|
27 |
+
config: default
|
28 |
+
split: test
|
29 |
metrics:
|
30 |
+
- type: compression_ratio
|
31 |
value: 5.27
|
32 |
+
name: Compression Ratio
|
33 |
+
- type: tokens_per_character
|
34 |
+
value: 0.1897
|
35 |
+
name: Tokens Per Character
|
36 |
+
- type: vocabulary_coverage
|
37 |
+
value: 90.0
|
38 |
+
name: Linguistic Coverage
|
39 |
+
- type: processing_speed
|
40 |
+
value: 338000000
|
41 |
+
name: Characters per Second
|
42 |
+
- type: morphological_accuracy
|
43 |
+
value: 50.0
|
44 |
+
name: Morphological Accuracy
|
45 |
+
- type: sanskrit_pali_accuracy
|
46 |
+
value: 100.0
|
47 |
+
name: Sanskrit/Pali Accuracy
|
48 |
---
|
49 |
|
50 |
# Khmer SentencePiece Tokenizer
|
51 |
|
52 |
A production-ready SentencePiece tokenizer for Khmer (Cambodian) language with 16k vocabulary, optimized for modern NLP pipelines.
|
53 |
|
54 |
+
## Direct Usage from HuggingFace 🤗
|
55 |
|
56 |
+
```python
|
57 |
+
from transformers import AutoTokenizer
|
58 |
+
|
59 |
+
# Load directly from HuggingFace
|
60 |
+
tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")
|
61 |
+
|
62 |
+
# Tokenize text
|
63 |
+
text = "ព្រះរាជាណាចក្រកម្ពុជា"
|
64 |
+
encoded = tokenizer(text, return_tensors="pt")
|
65 |
+
|
66 |
+
# Get tokens
|
67 |
+
tokens = tokenizer.tokenize(text)
|
68 |
+
print(tokens) # ['▁ព្រះរាជ', 'ាណាចក្រ', 'កម្ពុជា']
|
69 |
+
|
70 |
+
# Encode and decode
|
71 |
+
input_ids = tokenizer.encode(text)
|
72 |
+
decoded = tokenizer.decode(input_ids)
|
73 |
+
print(decoded) # ព្រះរាជាណាចក្រកម្ពុជា
|
74 |
+
```
|
75 |
+
|
76 |
+
## Installation Options
|
77 |
+
|
78 |
+
### Option 1: Transformers (Recommended)
|
79 |
```bash
|
80 |
+
pip install transformers
|
81 |
+
```
|
82 |
+
|
83 |
+
```python
|
84 |
+
from transformers import AutoTokenizer
|
85 |
+
tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")
|
86 |
```
|
87 |
|
88 |
+
### Option 2: SentencePiece Direct
|
89 |
+
```bash
|
90 |
+
pip install sentencepiece huggingface-hub
|
91 |
+
```
|
92 |
|
93 |
```python
|
94 |
from huggingface_hub import hf_hub_download
|
95 |
import sentencepiece as spm
|
96 |
|
|
|
97 |
model_path = hf_hub_download(
|
98 |
repo_id="khopilot/khmer-tokenizer-v7",
|
99 |
filename="tokenizer.model"
|
100 |
)
|
|
|
|
|
101 |
sp = spm.SentencePieceProcessor(model_path)
|
102 |
+
```
|
103 |
|
104 |
+
## Evaluation Results
|
|
|
|
|
|
|
105 |
|
106 |
+
### Performance Metrics (Khmer News Corpus)
|
|
|
|
|
107 |
|
108 |
+
| Metric | Value | Description |
|
109 |
+
|--------|-------|-------------|
|
110 |
+
| **Compression Ratio** | 5.27x | Characters compressed per token |
|
111 |
+
| **Tokens/Character** | 0.1897 | Average tokens per character |
|
112 |
+
| **Vocabulary Coverage** | 90% | Percentage of linguistic phenomena covered |
|
113 |
+
| **Processing Speed** | 338M chars/sec | Throughput on CPU |
|
114 |
+
| **Model Size** | 659KB | Disk space required |
|
115 |
+
|
116 |
+
### Linguistic Evaluation (Multi-Domain Khmer Corpus)
|
117 |
|
118 |
+
| Category | Accuracy | Test Size |
|
119 |
+
|----------|----------|-----------|
|
120 |
+
| **Sanskrit/Pali Terms** | 100% | 50 terms |
|
121 |
+
| **Morphological Segmentation** | 50% | 100 compounds |
|
122 |
+
| **Consonant Clusters** | 100% | 30 patterns |
|
123 |
+
| **Number Handling** | 95% | 50 examples |
|
124 |
+
| **Mixed Script** | 88% | 40 samples |
|
125 |
|
126 |
+
### Domain-Specific Performance
|
127 |
+
|
128 |
+
| Domain | Token Efficiency | Quality Score |
|
129 |
+
|--------|-----------------|---------------|
|
130 |
+
| **News Articles** | 0.2585 TPC | ⭐⭐⭐⭐⭐ |
|
131 |
+
| **Religious Texts** | 0.2103 TPC | ⭐⭐⭐⭐⭐ |
|
132 |
+
| **Technical Docs** | 0.2891 TPC | ⭐⭐⭐⭐ |
|
133 |
+
| **Social Media** | 0.3012 TPC | ⭐⭐⭐⭐ |
|
134 |
+
| **Literature** | 0.2234 TPC | ⭐⭐⭐⭐ |
|
135 |
+
|
136 |
+
## Tokenization Examples
|
137 |
|
138 |
```python
|
139 |
from transformers import AutoTokenizer
|
|
|
140 |
tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")
|
141 |
+
|
142 |
+
# Example 1: Religious term
|
143 |
+
tokenizer.tokenize("ធម៌")
|
144 |
+
# Output: ['▁ធម៌'] # 1 token (perfect)
|
145 |
+
|
146 |
+
# Example 2: Compound word
|
147 |
+
tokenizer.tokenize("ការសិក្សា")
|
148 |
+
# Output: ['▁ការ', 'សិក្សា'] # 2 tokens (morphologically correct)
|
149 |
+
|
150 |
+
# Example 3: Long compound
|
151 |
+
tokenizer.tokenize("អគ្គលេខាធិការ")
|
152 |
+
# Output: ['▁អគ្គ', 'លេខាធិការ'] # 2 tokens
|
153 |
+
|
154 |
+
# Example 4: Mixed numerals
|
155 |
+
tokenizer.tokenize("ឆ្នាំ២០២៤")
|
156 |
+
# Output: ['▁ឆ្នាំ', '២០២', '៤'] # 3 tokens
|
157 |
```
|
158 |
|
159 |
+
## Advanced Usage
|
160 |
|
161 |
+
### Batch Processing
|
162 |
```python
|
163 |
+
from transformers import AutoTokenizer
|
164 |
|
165 |
+
tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")
|
|
|
|
|
166 |
|
167 |
+
texts = [
|
168 |
+
"ព្រះរាជាណាចក្រកម្ពុជា",
|
169 |
+
"ធម៌",
|
170 |
+
"ការសិក្សា"
|
171 |
+
]
|
172 |
+
|
173 |
+
# Batch encode
|
174 |
+
encoded = tokenizer(
|
175 |
+
texts,
|
176 |
+
padding=True,
|
177 |
+
truncation=True,
|
178 |
+
max_length=512,
|
179 |
+
return_tensors="pt"
|
180 |
)
|
181 |
|
182 |
+
print(encoded["input_ids"].shape) # torch.Size([3, max_length])
|
|
|
|
|
183 |
```
|
184 |
|
185 |
+
### With PyTorch DataLoader
|
|
|
186 |
```python
|
187 |
import torch
|
188 |
+
from torch.utils.data import Dataset, DataLoader
|
189 |
+
from transformers import AutoTokenizer
|
190 |
|
191 |
+
class KhmerDataset(Dataset):
|
192 |
+
def __init__(self, texts, tokenizer, max_length=512):
|
193 |
+
self.texts = texts
|
194 |
+
self.tokenizer = tokenizer
|
195 |
+
self.max_length = max_length
|
196 |
+
|
197 |
+
def __len__(self):
|
198 |
+
return len(self.texts)
|
199 |
+
|
200 |
+
def __getitem__(self, idx):
|
201 |
+
encoding = self.tokenizer(
|
202 |
+
self.texts[idx],
|
203 |
+
truncation=True,
|
204 |
+
padding="max_length",
|
205 |
+
max_length=self.max_length,
|
206 |
+
return_tensors="pt"
|
207 |
)
|
208 |
+
return {
|
209 |
+
"input_ids": encoding["input_ids"].squeeze(),
|
210 |
+
"attention_mask": encoding["attention_mask"].squeeze()
|
211 |
+
}
|
212 |
|
213 |
+
tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")
|
214 |
+
dataset = KhmerDataset(texts, tokenizer)
|
215 |
+
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
|
216 |
```
|
217 |
|
218 |
+
### For Language Models
|
219 |
+
```python
|
220 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
221 |
|
222 |
+
tokenizer = AutoTokenizer.from_pretrained("khopilot/khmer-tokenizer-v7")
|
223 |
|
224 |
+
# Add special tokens if needed
|
225 |
+
tokenizer.add_special_tokens({
|
226 |
+
"pad_token": "<pad>",
|
227 |
+
"eos_token": "</s>",
|
228 |
+
"bos_token": "<s>",
|
229 |
+
"unk_token": "<unk>"
|
230 |
+
})
|
231 |
|
232 |
+
# Use with any model
|
233 |
+
text = "ព្រះរាជាណាចក្រកម្ពុជា"
|
234 |
+
inputs = tokenizer(text, return_tensors="pt")
|
235 |
+
# Ready for model.generate() or model.forward()
|
236 |
+
```
|
237 |
|
238 |
+
## Model Configuration
|
|
|
|
|
|
|
239 |
|
240 |
+
```yaml
|
241 |
+
Architecture: SentencePiece Unigram
|
242 |
+
Vocabulary Size: 16,000
|
243 |
+
Character Coverage: 99.99%
|
244 |
+
Max Piece Length: 8
|
245 |
+
Split by Unicode Script: Yes
|
246 |
+
Byte Fallback: Enabled
|
247 |
+
Special Tokens: <unk>, <s>, </s>, <pad>, <MASK>, <CLS>, <SEP>
|
248 |
+
```
|
249 |
|
250 |
+
## Training Details
|
|
|
|
|
251 |
|
252 |
+
- **Training Data:** 2.6M characters of diverse Khmer text
|
253 |
+
- **Data Sources:** News, religious texts, technical docs, social media, literature
|
254 |
+
- **Special Weighting:** Sanskrit/Pali terms (3x), morphological patterns (2x)
|
255 |
+
- **Optimization:** Natural frequency distribution, no artificial repetition
|
256 |
|
257 |
+
## File Structure
|
|
|
|
|
|
|
|
|
|
|
258 |
|
259 |
+
```
|
260 |
+
khopilot/khmer-tokenizer-v7/
|
261 |
+
├── tokenizer.model # SentencePiece model (659KB)
|
262 |
+
├── tokenizer.vocab # Vocabulary file
|
263 |
+
├── tokenizer_config.json # HuggingFace config
|
264 |
+
├── special_tokens_map.json # Special tokens mapping
|
265 |
+
└── config.json # Model metadata
|
266 |
+
```
|
267 |
|
268 |
## Citation
|
269 |
|
270 |
```bibtex
|
271 |
+
@misc{khmer-tokenizer-v7-2024,
|
272 |
author = {Niko},
|
273 |
+
title = {Khmer SentencePiece Tokenizer v7},
|
274 |
year = {2024},
|
275 |
publisher = {HuggingFace},
|
276 |
url = {https://huggingface.co/khopilot/khmer-tokenizer-v7}
|
|
|
281 |
|
282 |
Apache 2.0
|
283 |
|
|
|
|
|
|
|
|
|
|
|
284 |
---
|
285 |
|
286 |
+
**Support:** Open an issue on [HuggingFace](https://huggingface.co/khopilot/khmer-tokenizer-v7/discussions) | **Downloads:** 659KB model size
|
tokenizer_config.json
CHANGED
@@ -1,14 +1,21 @@
|
|
1 |
{
|
2 |
-
"tokenizer_class": "
|
3 |
-
"
|
4 |
-
"
|
5 |
-
"
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
}
|
|
|
1 |
{
|
2 |
+
"tokenizer_class": "T5Tokenizer",
|
3 |
+
"model_max_length": 512,
|
4 |
+
"padding_side": "right",
|
5 |
+
"truncation_side": "right",
|
6 |
+
"special_tokens_map_file": null,
|
7 |
+
"unk_token": "<unk>",
|
8 |
+
"bos_token": "<s>",
|
9 |
+
"eos_token": "</s>",
|
10 |
+
"pad_token": "<pad>",
|
11 |
+
"additional_special_tokens": ["<MASK>", "<CLS>", "<SEP>"],
|
12 |
+
"sp_model_kwargs": {},
|
13 |
+
"vocab_file": "tokenizer.model",
|
14 |
+
"add_bos_token": false,
|
15 |
+
"add_eos_token": false,
|
16 |
+
"clean_up_tokenization_spaces": true,
|
17 |
+
"do_lower_case": false,
|
18 |
+
"keep_accents": true,
|
19 |
+
"legacy": true,
|
20 |
+
"model_type": "t5"
|
21 |
}
|