khopilot commited on
Commit
20bafc5
·
verified ·
1 Parent(s): fb45150

Upload folder using huggingface_hub

Browse files
Files changed (6) hide show
  1. README.md +176 -0
  2. config.json +20 -0
  3. special_tokens_map.json +11 -0
  4. tokenizer.model +3 -0
  5. tokenizer.vocab +0 -0
  6. tokenizer_config.json +14 -0
README.md ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Khmer Tokenizer V7 - Advanced SentencePiece Model
2
+
3
+ ## Model Details
4
+
5
+ ### Model Description
6
+ Advanced Khmer tokenizer trained with SentencePiece Unigram algorithm, optimized for superior Sanskrit/Pali handling and morphological awareness.
7
+
8
+ - **Developed by:** Niko (Freelance Full-Stack Developer)
9
+ - **Model type:** SentencePiece Unigram Tokenizer
10
+ - **Language:** Khmer (km)
11
+ - **License:** Apache 2.0
12
+ - **Model version:** V7.0
13
+ - **Vocabulary size:** 16,000 tokens
14
+
15
+ ### Model Sources
16
+ - **Repository:** [GitHub - khmer-tokenizer-v7](https://github.com/yourusername/khmer-tokenizer-v7)
17
+ - **Demo:** Available in repository
18
+
19
+ ## Performance Metrics
20
+
21
+ ### PhD-Level Evaluation Score: 78.0/100
22
+
23
+ | Metric | Score | Grade | Details |
24
+ |--------|-------|-------|---------|
25
+ | Statistical | 90/100 | A | TPC: 0.2206 (excellent compression) |
26
+ | Linguistic | 70/100 | B | 61.8% coverage of Khmer phenomena |
27
+ | Information Theory | 70/100 | B | 51.3% compression efficiency |
28
+ | Morphological | 70/100 | B | 50% accuracy (vs 0% in V6.5) |
29
+ | Performance | 90/100 | A | 14.6M chars/sec throughput |
30
+
31
+ ### Key Improvements Over V6.5
32
+
33
+ | Metric | V6.5 | V7 | Improvement |
34
+ |--------|------|-----|-------------|
35
+ | Tokens Per Character | 0.45 | 0.22 | **51% better** |
36
+ | Sanskrit/Pali (ធម៌) | 5 tokens | 1 token | **80% reduction** |
37
+ | Morphological Accuracy | 0% | 50% | **+50 points** |
38
+ | Vocabulary Utilization | 0.66% | 1.14% | **73% increase** |
39
+
40
+ ## Uses
41
+
42
+ ### Direct Use
43
+ - Text preprocessing for Khmer NLP tasks
44
+ - Machine translation systems
45
+ - Text generation models
46
+ - Information retrieval
47
+ - Text classification
48
+
49
+ ### Downstream Use
50
+ - Fine-tuning language models for Khmer
51
+ - Building Khmer chatbots and assistants
52
+ - Document processing pipelines
53
+ - OCR post-processing
54
+
55
+ ### Out-of-Scope Use
56
+ - Languages other than Khmer
57
+ - Real-time speech processing (optimized for text)
58
+ - Character-level tasks
59
+
60
+ ## Bias, Risks, and Limitations
61
+
62
+ ### Technical Limitations
63
+ - 50% morphological accuracy (room for improvement)
64
+ - Deviation from Zipf's law (α=0.505 vs expected 0.9-1.2)
65
+ - Some vowel combinations still split suboptimally
66
+
67
+ ### Recommendations
68
+ - Validate on domain-specific text before production use
69
+ - Consider ensemble approaches for critical applications
70
+ - Monitor performance on out-of-domain text
71
+
72
+ ## How to Get Started
73
+
74
+ ```python
75
+ import sentencepiece as spm
76
+
77
+ # Load the model
78
+ sp = spm.SentencePieceProcessor(model_file='khmer_v7.model')
79
+
80
+ # Tokenize text
81
+ text = "ព្រះរាជាណាចក្រកម្ពុជា"
82
+ tokens = sp.encode(text, out_type=str)
83
+ print(tokens) # ['ព្រះរាជ', 'ាណាចក្រ', 'កម្ពុជា']
84
+
85
+ # Decode tokens
86
+ token_ids = sp.encode(text)
87
+ decoded = sp.decode(token_ids)
88
+ print(decoded) # ព្រះរាជាណាចក្រកម្ពុជា
89
+ ```
90
+
91
+ ## Training Details
92
+
93
+ ### Training Data
94
+ - **Source:** Combined Khmer corpus (9.3MB)
95
+ - **Size:** 2.6M characters of unique, natural Khmer text
96
+ - **Composition:**
97
+ - News articles
98
+ - Religious/Buddhist texts
99
+ - Technical documentation
100
+ - Literary works
101
+ - Colloquial text
102
+ - Sanskrit/Pali terms (3x weighted)
103
+ - Morphological patterns (2x weighted)
104
+
105
+ ### Training Procedure
106
+
107
+ #### Preprocessing
108
+ 1. NFC normalization
109
+ 2. Duplicate removal
110
+ 3. Sanskrit/Pali term injection (3x weight)
111
+ 4. Morphological boundary hints (2x weight)
112
+
113
+ #### Training Hyperparameters
114
+ - **Model type:** Unigram
115
+ - **Vocabulary size:** 16,000
116
+ - **Character coverage:** 0.9999
117
+ - **Max piece length:** 8
118
+ - **Split by unicode script:** True
119
+ - **Treat whitespace as suffix:** True
120
+ - **Byte fallback:** True
121
+ - **Threads:** 16
122
+
123
+ ### Training Infrastructure
124
+ - **Hardware:** MacOS (Darwin 24.4.0)
125
+ - **Software:** SentencePiece 0.1.99
126
+
127
+ ## Evaluation
128
+
129
+ ### Testing Data
130
+ Six categories of Khmer text:
131
+ 1. News articles
132
+ 2. Buddhist/religious texts
133
+ 3. Technical documentation
134
+ 4. Literary/formal text
135
+ 5. Colloquial/social media
136
+ 6. Mixed numerals and dates
137
+
138
+ ### Metrics
139
+
140
+ #### Compression Efficiency
141
+ - **Tokens Per Character (TPC):** 0.2206
142
+ - **Standard Deviation:** 0.0483
143
+ - **95% CI:** [0.1622, 0.3017]
144
+
145
+ #### Linguistic Coverage
146
+ - **Consonant Clusters:** 100% optimal
147
+ - **Sanskrit/Pali Loans:** 100% optimal
148
+ - **Vowel Combinations:** 25% optimal
149
+ - **Diacritics:** 50% optimal
150
+ - **Overall:** 61.8%
151
+
152
+ #### Special Features
153
+ ✅ **Sanskrit/Pali Excellence:** ធម៌ → 1 token (was 5 tokens)
154
+ ✅ **Morphological Awareness:** ការសិក្សា → ['ការ', 'សិក្សា']
155
+ ✅ **Production Speed:** 14.6M chars/sec
156
+
157
+ ## Environmental Impact
158
+ Minimal - training completed in minutes on standard hardware.
159
+
160
+ ## Citation
161
+
162
+ ```bibtex
163
+ @software{khmer_tokenizer_v7_2024,
164
+ author = {Niko},
165
+ title = {Khmer Tokenizer V7 - Advanced SentencePiece Model},
166
+ year = {2024},
167
+ version = {7.0},
168
+ url = {https://github.com/yourusername/khmer-tokenizer-v7}
169
+ }
170
+ ```
171
+
172
+ ## Model Card Contact
173
+ For questions or feedback, please open an issue on GitHub.
174
+
175
+ ---
176
+ *Generated based on PhD-level linguistic analysis and evaluation*
config.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "khmer_tokenizer_v7",
3
+ "tokenizer_type": "sentencepiece_unigram",
4
+ "vocab_size": 16000,
5
+ "language": "km",
6
+ "version": "7.0",
7
+ "metrics": {
8
+ "phd_score": 78.0,
9
+ "tpc": 0.2206,
10
+ "morphological_accuracy": 0.5,
11
+ "linguistic_coverage": 0.618,
12
+ "sanskrit_pali_optimal": true
13
+ },
14
+ "training": {
15
+ "corpus_size": "2.6M chars",
16
+ "character_coverage": 0.9999,
17
+ "max_piece_length": 8,
18
+ "byte_fallback": true
19
+ }
20
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "unk_token": "<unk>",
3
+ "bos_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "pad_token": "<pad>",
6
+ "additional_special_tokens": [
7
+ "<MASK>",
8
+ "<CLS>",
9
+ "<SEP>"
10
+ ]
11
+ }
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2b0d40784b70c03f553de0f736513fd169c8fd825b40640117f502956b452a69
3
+ size 659464
tokenizer.vocab ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "tokenizer_class": "PreTrainedTokenizerFast",
3
+ "model_type": "sentencepiece",
4
+ "vocab_file": "khmer_v7.model",
5
+ "special_tokens": {
6
+ "unk_token": "<unk>",
7
+ "bos_token": "<s>",
8
+ "eos_token": "</s>",
9
+ "pad_token": "<pad>",
10
+ "mask_token": "<MASK>",
11
+ "cls_token": "<CLS>",
12
+ "sep_token": "<SEP>"
13
+ }
14
+ }