ZombitX64 Thai Tokenizer
A simple Thai language tokenizer that properly handles newlines and Thai text segmentation.
Features
- Newline Preservation: Correctly handles and preserves newlines in tokenized text
- Thai Character Support: Recognizes and processes Thai Unicode characters
- Hugging Face Compatible: Works with transformers library
- Simple API: Easy to use tokenize and detokenize methods
Usage
from transformers import AutoTokenizer
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("ZombitX64/zombitx64-thaitokenizer")
# Tokenize text
text = "สวัสดีครับ\nนี่คือตัวอย่าง"
tokens = tokenizer.tokenize(text)
print(tokens)
# Encode to IDs
token_ids = tokenizer.encode(text)
print(token_ids)
# Decode back
decoded = tokenizer.decode(token_ids)
print(decoded)
Model Details
- Model Type: Thai Tokenizer
- Language: Thai (th)
- Vocab Size: 112
- Max Length: 512
Training Data
This tokenizer was trained on basic Thai character sets and common patterns.
Limitations
- Basic Thai word segmentation (can be improved with pythainlp)
- Simple vocabulary (expandable for specific use cases)
Contact
For questions or issues, please visit the GitHub repository.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support