Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

README.md +176 -0
config.json +20 -0
special_tokens_map.json +11 -0
tokenizer.model +3 -0
tokenizer.vocab +0 -0
tokenizer_config.json +14 -0

README.md ADDED Viewed

	@@ -0,0 +1,176 @@

+# Khmer Tokenizer V7 - Advanced SentencePiece Model
+## Model Details
+### Model Description
+Advanced Khmer tokenizer trained with SentencePiece Unigram algorithm, optimized for superior Sanskrit/Pali handling and morphological awareness.
+- **Developed by:** Niko (Freelance Full-Stack Developer)
+- **Model type:** SentencePiece Unigram Tokenizer
+- **Language:** Khmer (km)
+- **License:** Apache 2.0
+- **Model version:** V7.0
+- **Vocabulary size:** 16,000 tokens
+### Model Sources
+- **Repository:** [GitHub - khmer-tokenizer-v7](https://github.com/yourusername/khmer-tokenizer-v7)
+- **Demo:** Available in repository
+## Performance Metrics
+### PhD-Level Evaluation Score: 78.0/100
+| Metric | Score | Grade | Details |
+|--------|-------|-------|---------|
+| Statistical | 90/100 | A | TPC: 0.2206 (excellent compression) |
+| Linguistic | 70/100 | B | 61.8% coverage of Khmer phenomena |
+| Information Theory | 70/100 | B | 51.3% compression efficiency |
+| Morphological | 70/100 | B | 50% accuracy (vs 0% in V6.5) |
+| Performance | 90/100 | A | 14.6M chars/sec throughput |
+### Key Improvements Over V6.5
+| Metric | V6.5 | V7 | Improvement |
+|--------|------|-----|-------------|
+| Tokens Per Character | 0.45 | 0.22 | **51% better** |
+| Sanskrit/Pali (ធម៌) | 5 tokens | 1 token | **80% reduction** |
+| Morphological Accuracy | 0% | 50% | **+50 points** |
+| Vocabulary Utilization | 0.66% | 1.14% | **73% increase** |
+## Uses
+### Direct Use
+- Text preprocessing for Khmer NLP tasks
+- Machine translation systems
+- Text generation models
+- Information retrieval
+- Text classification
+### Downstream Use
+- Fine-tuning language models for Khmer
+- Building Khmer chatbots and assistants
+- Document processing pipelines
+- OCR post-processing
+### Out-of-Scope Use
+- Languages other than Khmer
+- Real-time speech processing (optimized for text)
+- Character-level tasks
+## Bias, Risks, and Limitations
+### Technical Limitations
+- 50% morphological accuracy (room for improvement)
+- Deviation from Zipf's law (α=0.505 vs expected 0.9-1.2)
+- Some vowel combinations still split suboptimally
+### Recommendations
+- Validate on domain-specific text before production use
+- Consider ensemble approaches for critical applications
+- Monitor performance on out-of-domain text
+## How to Get Started
+```python
+import sentencepiece as spm
+# Load the model
+sp = spm.SentencePieceProcessor(model_file='khmer_v7.model')
+# Tokenize text
+text = "ព្រះរាជាណាចក្រកម្ពុជា"
+tokens = sp.encode(text, out_type=str)
+print(tokens)  # ['ព្រះរាជ', 'ាណាចក្រ', 'កម្ពុជា']
+# Decode tokens
+token_ids = sp.encode(text)
+decoded = sp.decode(token_ids)
+print(decoded)  # ព្រះរាជាណាចក្រកម្ពុជា
+```
+## Training Details
+### Training Data
+- **Source:** Combined Khmer corpus (9.3MB)
+- **Size:** 2.6M characters of unique, natural Khmer text
+- **Composition:**
+  - News articles
+  - Religious/Buddhist texts
+  - Technical documentation
+  - Literary works
+  - Colloquial text
+  - Sanskrit/Pali terms (3x weighted)
+  - Morphological patterns (2x weighted)
+### Training Procedure
+#### Preprocessing
+1. NFC normalization
+2. Duplicate removal
+3. Sanskrit/Pali term injection (3x weight)
+4. Morphological boundary hints (2x weight)
+#### Training Hyperparameters
+- **Model type:** Unigram
+- **Vocabulary size:** 16,000
+- **Character coverage:** 0.9999
+- **Max piece length:** 8
+- **Split by unicode script:** True
+- **Treat whitespace as suffix:** True
+- **Byte fallback:** True
+- **Threads:** 16
+### Training Infrastructure
+- **Hardware:** MacOS (Darwin 24.4.0)
+- **Software:** SentencePiece 0.1.99
+## Evaluation
+### Testing Data
+Six categories of Khmer text:
+1. News articles
+2. Buddhist/religious texts
+3. Technical documentation
+4. Literary/formal text
+5. Colloquial/social media
+6. Mixed numerals and dates
+### Metrics
+#### Compression Efficiency
+- **Tokens Per Character (TPC):** 0.2206
+- **Standard Deviation:** 0.0483
+- **95% CI:** [0.1622, 0.3017]
+#### Linguistic Coverage
+- **Consonant Clusters:** 100% optimal
+- **Sanskrit/Pali Loans:** 100% optimal
+- **Vowel Combinations:** 25% optimal
+- **Diacritics:** 50% optimal
+- **Overall:** 61.8%
+#### Special Features
+✅ **Sanskrit/Pali Excellence:** ធម៌ → 1 token (was 5 tokens)
+✅ **Morphological Awareness:** ការសិក្សា → ['ការ', 'សិក្សា']
+✅ **Production Speed:** 14.6M chars/sec
+## Environmental Impact
+Minimal - training completed in minutes on standard hardware.
+## Citation
+```bibtex
+@software{khmer_tokenizer_v7_2024,
+  author = {Niko},
+  title = {Khmer Tokenizer V7 - Advanced SentencePiece Model},
+  year = {2024},
+  version = {7.0},
+  url = {https://github.com/yourusername/khmer-tokenizer-v7}
+}
+```
+## Model Card Contact
+For questions or feedback, please open an issue on GitHub.
+---
+*Generated based on PhD-level linguistic analysis and evaluation*

config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "model_type": "khmer_tokenizer_v7",
+  "tokenizer_type": "sentencepiece_unigram",
+  "vocab_size": 16000,
+  "language": "km",
+  "version": "7.0",
+  "metrics": {
+    "phd_score": 78.0,
+    "tpc": 0.2206,
+    "morphological_accuracy": 0.5,
+    "linguistic_coverage": 0.618,
+    "sanskrit_pali_optimal": true
+  },
+  "training": {
+    "corpus_size": "2.6M chars",
+    "character_coverage": 0.9999,
+    "max_piece_length": 8,
+    "byte_fallback": true
+  }
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,11 @@

+{
+  "unk_token": "<unk>",
+  "bos_token": "<s>",
+  "eos_token": "</s>",
+  "pad_token": "<pad>",
+  "additional_special_tokens": [
+    "<MASK>",
+    "<CLS>",
+    "<SEP>"
+  ]
+}

tokenizer.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2b0d40784b70c03f553de0f736513fd169c8fd825b40640117f502956b452a69
+size 659464

tokenizer.vocab ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "model_type": "sentencepiece",
+  "vocab_file": "khmer_v7.model",
+  "special_tokens": {
+    "unk_token": "<unk>",
+    "bos_token": "<s>",
+    "eos_token": "</s>",
+    "pad_token": "<pad>",
+    "mask_token": "<MASK>",
+    "cls_token": "<CLS>",
+    "sep_token": "<SEP>"
+  }
+}