Upload folder using huggingface_hub
Browse files- README.md +176 -0
- config.json +20 -0
- special_tokens_map.json +11 -0
- tokenizer.model +3 -0
- tokenizer.vocab +0 -0
- tokenizer_config.json +14 -0
README.md
ADDED
@@ -0,0 +1,176 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Khmer Tokenizer V7 - Advanced SentencePiece Model
|
2 |
+
|
3 |
+
## Model Details
|
4 |
+
|
5 |
+
### Model Description
|
6 |
+
Advanced Khmer tokenizer trained with SentencePiece Unigram algorithm, optimized for superior Sanskrit/Pali handling and morphological awareness.
|
7 |
+
|
8 |
+
- **Developed by:** Niko (Freelance Full-Stack Developer)
|
9 |
+
- **Model type:** SentencePiece Unigram Tokenizer
|
10 |
+
- **Language:** Khmer (km)
|
11 |
+
- **License:** Apache 2.0
|
12 |
+
- **Model version:** V7.0
|
13 |
+
- **Vocabulary size:** 16,000 tokens
|
14 |
+
|
15 |
+
### Model Sources
|
16 |
+
- **Repository:** [GitHub - khmer-tokenizer-v7](https://github.com/yourusername/khmer-tokenizer-v7)
|
17 |
+
- **Demo:** Available in repository
|
18 |
+
|
19 |
+
## Performance Metrics
|
20 |
+
|
21 |
+
### PhD-Level Evaluation Score: 78.0/100
|
22 |
+
|
23 |
+
| Metric | Score | Grade | Details |
|
24 |
+
|--------|-------|-------|---------|
|
25 |
+
| Statistical | 90/100 | A | TPC: 0.2206 (excellent compression) |
|
26 |
+
| Linguistic | 70/100 | B | 61.8% coverage of Khmer phenomena |
|
27 |
+
| Information Theory | 70/100 | B | 51.3% compression efficiency |
|
28 |
+
| Morphological | 70/100 | B | 50% accuracy (vs 0% in V6.5) |
|
29 |
+
| Performance | 90/100 | A | 14.6M chars/sec throughput |
|
30 |
+
|
31 |
+
### Key Improvements Over V6.5
|
32 |
+
|
33 |
+
| Metric | V6.5 | V7 | Improvement |
|
34 |
+
|--------|------|-----|-------------|
|
35 |
+
| Tokens Per Character | 0.45 | 0.22 | **51% better** |
|
36 |
+
| Sanskrit/Pali (ធម៌) | 5 tokens | 1 token | **80% reduction** |
|
37 |
+
| Morphological Accuracy | 0% | 50% | **+50 points** |
|
38 |
+
| Vocabulary Utilization | 0.66% | 1.14% | **73% increase** |
|
39 |
+
|
40 |
+
## Uses
|
41 |
+
|
42 |
+
### Direct Use
|
43 |
+
- Text preprocessing for Khmer NLP tasks
|
44 |
+
- Machine translation systems
|
45 |
+
- Text generation models
|
46 |
+
- Information retrieval
|
47 |
+
- Text classification
|
48 |
+
|
49 |
+
### Downstream Use
|
50 |
+
- Fine-tuning language models for Khmer
|
51 |
+
- Building Khmer chatbots and assistants
|
52 |
+
- Document processing pipelines
|
53 |
+
- OCR post-processing
|
54 |
+
|
55 |
+
### Out-of-Scope Use
|
56 |
+
- Languages other than Khmer
|
57 |
+
- Real-time speech processing (optimized for text)
|
58 |
+
- Character-level tasks
|
59 |
+
|
60 |
+
## Bias, Risks, and Limitations
|
61 |
+
|
62 |
+
### Technical Limitations
|
63 |
+
- 50% morphological accuracy (room for improvement)
|
64 |
+
- Deviation from Zipf's law (α=0.505 vs expected 0.9-1.2)
|
65 |
+
- Some vowel combinations still split suboptimally
|
66 |
+
|
67 |
+
### Recommendations
|
68 |
+
- Validate on domain-specific text before production use
|
69 |
+
- Consider ensemble approaches for critical applications
|
70 |
+
- Monitor performance on out-of-domain text
|
71 |
+
|
72 |
+
## How to Get Started
|
73 |
+
|
74 |
+
```python
|
75 |
+
import sentencepiece as spm
|
76 |
+
|
77 |
+
# Load the model
|
78 |
+
sp = spm.SentencePieceProcessor(model_file='khmer_v7.model')
|
79 |
+
|
80 |
+
# Tokenize text
|
81 |
+
text = "ព្រះរាជាណាចក្រកម្ពុជា"
|
82 |
+
tokens = sp.encode(text, out_type=str)
|
83 |
+
print(tokens) # ['ព្រះរាជ', 'ាណាចក្រ', 'កម្ពុជា']
|
84 |
+
|
85 |
+
# Decode tokens
|
86 |
+
token_ids = sp.encode(text)
|
87 |
+
decoded = sp.decode(token_ids)
|
88 |
+
print(decoded) # ព្រះរាជាណាចក្រកម្ពុជា
|
89 |
+
```
|
90 |
+
|
91 |
+
## Training Details
|
92 |
+
|
93 |
+
### Training Data
|
94 |
+
- **Source:** Combined Khmer corpus (9.3MB)
|
95 |
+
- **Size:** 2.6M characters of unique, natural Khmer text
|
96 |
+
- **Composition:**
|
97 |
+
- News articles
|
98 |
+
- Religious/Buddhist texts
|
99 |
+
- Technical documentation
|
100 |
+
- Literary works
|
101 |
+
- Colloquial text
|
102 |
+
- Sanskrit/Pali terms (3x weighted)
|
103 |
+
- Morphological patterns (2x weighted)
|
104 |
+
|
105 |
+
### Training Procedure
|
106 |
+
|
107 |
+
#### Preprocessing
|
108 |
+
1. NFC normalization
|
109 |
+
2. Duplicate removal
|
110 |
+
3. Sanskrit/Pali term injection (3x weight)
|
111 |
+
4. Morphological boundary hints (2x weight)
|
112 |
+
|
113 |
+
#### Training Hyperparameters
|
114 |
+
- **Model type:** Unigram
|
115 |
+
- **Vocabulary size:** 16,000
|
116 |
+
- **Character coverage:** 0.9999
|
117 |
+
- **Max piece length:** 8
|
118 |
+
- **Split by unicode script:** True
|
119 |
+
- **Treat whitespace as suffix:** True
|
120 |
+
- **Byte fallback:** True
|
121 |
+
- **Threads:** 16
|
122 |
+
|
123 |
+
### Training Infrastructure
|
124 |
+
- **Hardware:** MacOS (Darwin 24.4.0)
|
125 |
+
- **Software:** SentencePiece 0.1.99
|
126 |
+
|
127 |
+
## Evaluation
|
128 |
+
|
129 |
+
### Testing Data
|
130 |
+
Six categories of Khmer text:
|
131 |
+
1. News articles
|
132 |
+
2. Buddhist/religious texts
|
133 |
+
3. Technical documentation
|
134 |
+
4. Literary/formal text
|
135 |
+
5. Colloquial/social media
|
136 |
+
6. Mixed numerals and dates
|
137 |
+
|
138 |
+
### Metrics
|
139 |
+
|
140 |
+
#### Compression Efficiency
|
141 |
+
- **Tokens Per Character (TPC):** 0.2206
|
142 |
+
- **Standard Deviation:** 0.0483
|
143 |
+
- **95% CI:** [0.1622, 0.3017]
|
144 |
+
|
145 |
+
#### Linguistic Coverage
|
146 |
+
- **Consonant Clusters:** 100% optimal
|
147 |
+
- **Sanskrit/Pali Loans:** 100% optimal
|
148 |
+
- **Vowel Combinations:** 25% optimal
|
149 |
+
- **Diacritics:** 50% optimal
|
150 |
+
- **Overall:** 61.8%
|
151 |
+
|
152 |
+
#### Special Features
|
153 |
+
✅ **Sanskrit/Pali Excellence:** ធម៌ → 1 token (was 5 tokens)
|
154 |
+
✅ **Morphological Awareness:** ការសិក្សា → ['ការ', 'សិក្សា']
|
155 |
+
✅ **Production Speed:** 14.6M chars/sec
|
156 |
+
|
157 |
+
## Environmental Impact
|
158 |
+
Minimal - training completed in minutes on standard hardware.
|
159 |
+
|
160 |
+
## Citation
|
161 |
+
|
162 |
+
```bibtex
|
163 |
+
@software{khmer_tokenizer_v7_2024,
|
164 |
+
author = {Niko},
|
165 |
+
title = {Khmer Tokenizer V7 - Advanced SentencePiece Model},
|
166 |
+
year = {2024},
|
167 |
+
version = {7.0},
|
168 |
+
url = {https://github.com/yourusername/khmer-tokenizer-v7}
|
169 |
+
}
|
170 |
+
```
|
171 |
+
|
172 |
+
## Model Card Contact
|
173 |
+
For questions or feedback, please open an issue on GitHub.
|
174 |
+
|
175 |
+
---
|
176 |
+
*Generated based on PhD-level linguistic analysis and evaluation*
|
config.json
ADDED
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"model_type": "khmer_tokenizer_v7",
|
3 |
+
"tokenizer_type": "sentencepiece_unigram",
|
4 |
+
"vocab_size": 16000,
|
5 |
+
"language": "km",
|
6 |
+
"version": "7.0",
|
7 |
+
"metrics": {
|
8 |
+
"phd_score": 78.0,
|
9 |
+
"tpc": 0.2206,
|
10 |
+
"morphological_accuracy": 0.5,
|
11 |
+
"linguistic_coverage": 0.618,
|
12 |
+
"sanskrit_pali_optimal": true
|
13 |
+
},
|
14 |
+
"training": {
|
15 |
+
"corpus_size": "2.6M chars",
|
16 |
+
"character_coverage": 0.9999,
|
17 |
+
"max_piece_length": 8,
|
18 |
+
"byte_fallback": true
|
19 |
+
}
|
20 |
+
}
|
special_tokens_map.json
ADDED
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"unk_token": "<unk>",
|
3 |
+
"bos_token": "<s>",
|
4 |
+
"eos_token": "</s>",
|
5 |
+
"pad_token": "<pad>",
|
6 |
+
"additional_special_tokens": [
|
7 |
+
"<MASK>",
|
8 |
+
"<CLS>",
|
9 |
+
"<SEP>"
|
10 |
+
]
|
11 |
+
}
|
tokenizer.model
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:2b0d40784b70c03f553de0f736513fd169c8fd825b40640117f502956b452a69
|
3 |
+
size 659464
|
tokenizer.vocab
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
ADDED
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"tokenizer_class": "PreTrainedTokenizerFast",
|
3 |
+
"model_type": "sentencepiece",
|
4 |
+
"vocab_file": "khmer_v7.model",
|
5 |
+
"special_tokens": {
|
6 |
+
"unk_token": "<unk>",
|
7 |
+
"bos_token": "<s>",
|
8 |
+
"eos_token": "</s>",
|
9 |
+
"pad_token": "<pad>",
|
10 |
+
"mask_token": "<MASK>",
|
11 |
+
"cls_token": "<CLS>",
|
12 |
+
"sep_token": "<SEP>"
|
13 |
+
}
|
14 |
+
}
|