ZombitX64
/

zombitx64-thaitokenizer

text-processing

Model card Files Files and versions

ZombitX64 Thai Tokenizer

A simple Thai language tokenizer that properly handles newlines and Thai text segmentation.

Features

Newline Preservation: Correctly handles and preserves newlines in tokenized text
Thai Character Support: Recognizes and processes Thai Unicode characters
Hugging Face Compatible: Works with transformers library
Simple API: Easy to use tokenize and detokenize methods

Usage

from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("ZombitX64/zombitx64-thaitokenizer")

# Tokenize text
text = "สวัสดีครับ\nนี่คือตัวอย่าง"
tokens = tokenizer.tokenize(text)
print(tokens)

# Encode to IDs
token_ids = tokenizer.encode(text)
print(token_ids)

# Decode back
decoded = tokenizer.decode(token_ids)
print(decoded)

Model Details

Model Type: Thai Tokenizer
Language: Thai (th)
Vocab Size: 112
Max Length: 512

Training Data

This tokenizer was trained on basic Thai character sets and common patterns.

Limitations

Basic Thai word segmentation (can be improved with pythainlp)
Simple vocabulary (expandable for specific use cases)

Contact

For questions or issues, please visit the GitHub repository.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support