Advanced Thai Tokenizer V3
Overview
Advanced Thai language tokenizer (Unigram, HuggingFace-compatible) trained on a large, cleaned, real-world Thai corpus. Handles Thai, mixed Thai-English, numbers, and modern vocabulary. Designed for LLM/NLP use, with robust roundtrip accuracy and no byte-level artifacts.
Performance
- Overall Accuracy: 24/24 (100.0%)
- Vocabulary Size: 35,590 tokens
- Average Compression: 3.45 chars/token
- UNK Ratio: 0%
- Thai Character Coverage: 100%
- Tested on: Real-world, mixed, and edge-case sentences
- Training Corpus:
combined_thai_corpus.txt
(cleaned, deduplicated, multi-domain)
Key Features
- ✅ No Thai character corruption (no byte-level fallback, no normalization loss)
- ✅ Handles mixed Thai-English, numbers, and symbols
- ✅ Modern vocabulary (internet, technology, social, business)
- ✅ Efficient compression (subword, not word-level)
- ✅ Clean decoding without artifacts
- ✅ HuggingFace-compatible (tokenizer.json, vocab.json, config)
- ✅ Production-ready: tested, documented, and robust
Quick Start
from transformers import AutoTokenizer
# Load tokenizer from HuggingFace Hub
try:
tokenizer = AutoTokenizer.from_pretrained("ZombitX64/Thaitokenizer")
text = "นั่งตาก ลม"
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")
encoding = tokenizer(text, return_tensors=None, add_special_tokens=False)
decoded = tokenizer.decode(encoding['input_ids'], skip_special_tokens=True)
print(f"Original: {text}")
print(f"Decoded: {decoded}")
except Exception as e:
print(f"Error loading tokenizer: {e}")
Files
tokenizer.json
— Main tokenizer file (HuggingFace format)vocab.json
— Vocabulary mappingtokenizer_config.json
— Transformers configmetadata.json
— Performance and configuration detailsusage_examples.json
— Code examplesREADME.md
— This filecombined_thai_corpus.txt
— Training corpus (not included in repo, see dataset card)
Created: July 2025
Model Card for Advanced Thai Tokenizer V3
Model Details
Developed by: ZombitX64 (https://huggingface.co/ZombitX64)
Model type: Unigram (subword) tokenizer
Language(s): th (Thai), mixed Thai-English
License: Apache-2.0
Finetuned from model: N/A (trained from scratch)
Model Sources
Uses
Direct Use
- Tokenization for Thai LLMs, NLP, and downstream tasks
- Preprocessing for text classification, NER, QA, summarization, etc.
- Robust for mixed Thai-English, numbers, and social content
Downstream Use
- Plug into HuggingFace Transformers pipelines
- Use as tokenizer for Thai LLM pretraining/fine-tuning
- Integrate with spaCy, PyThaiNLP, or custom pipelines
Out-of-Scope Use
- Not a language model (no text generation by itself)
- Not suitable for non-Thai-centric tasks
Bias, Risks, and Limitations
- Trained on public Thai web/corpus data; may reflect real-world bias
- Not guaranteed to cover rare dialects, slang, or OCR errors
- No explicit filtering for toxic/biased content in corpus
- Tokenizer does not understand context/meaning (no disambiguation)
Recommendations
- For best results, use with LLMs or models trained on similar corpus
- For sensitive/critical applications, review corpus and test thoroughly
- For word-level tasks, use with context-aware models (NER, POS)
How to Get Started with the Model
from transformers import AutoTokenizer
# Load tokenizer from HuggingFace Hub
try:
tokenizer = AutoTokenizer.from_pretrained("ZombitX64/Thaitokenizer")
text = "นั่งตาก ลม"
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")
encoding = tokenizer(text, return_tensors=None, add_special_tokens=False)
decoded = tokenizer.decode(encoding['input_ids'], skip_special_tokens=True)
print(f"Original: {text}")
print(f"Decoded: {decoded}")
except Exception as e:
print(f"Error loading tokenizer: {e}")
Training Details
Training Data
- Source:
combined_thai_corpus.txt
(cleaned, deduplicated, multi-domain Thai text) - Size: 71.7M
- Preprocessing: Remove duplicates, normalize encoding, minimal cleaning, no normalization, no byte fallback
Training Procedure
- Tokenizer: HuggingFace Tokenizers (Unigram)
- Vocab size: 35,590
- Special tokens:
- Pre-tokenizer: Punctuation only
- No normalization, no post-processor, no decoder
- Training regime: CPU, Python 3.11, single run, see script for details
Speeds, Sizes, Times
- Training time: -
- Checkpoint size: tokenizer.json ~[size] KB
Evaluation
Testing Data, Factors & Metrics
- Testing data: Real-world Thai sentences, mixed content, edge cases
- Metrics: Roundtrip accuracy, UNK ratio, Thai character coverage, compression ratio
- Results: 100% roundtrip, 0% UNK, 100% Thai char coverage, 3.45 chars/token
Environmental Impact
- Training on CPU, low energy usage
- No large-scale GPU/TPU compute required
Technical Specifications
- Model architecture: Unigram (subword) tokenizer
- Software: tokenizers==0.15+, Python 3.11
- Hardware: Standard CPU (no GPU required)
Citation
If you use this tokenizer, please cite:
@misc{zombitx64_thaitokenizer_v3_2025,
author = {ZombitX64},
title = {Advanced Thai Tokenizer V3},
year = {2025},
howpublished = {\\url{https://huggingface.co/ZombitX64/Thaitokenizer}}
}
Model Card Authors
Model Card Contact
For questions or feedback, open an issue on the HuggingFace repo or contact ZombitX64 via HuggingFace.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support