ZombitX64 Thai Tokenizer

A simple Thai language tokenizer that properly handles newlines and Thai text segmentation.

Features

  • Newline Preservation: Correctly handles and preserves newlines in tokenized text
  • Thai Character Support: Recognizes and processes Thai Unicode characters
  • Hugging Face Compatible: Works with transformers library
  • Simple API: Easy to use tokenize and detokenize methods

Usage

from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("ZombitX64/zombitx64-thaitokenizer")

# Tokenize text
text = "สวัสดีครับ\nนี่คือตัวอย่าง"
tokens = tokenizer.tokenize(text)
print(tokens)

# Encode to IDs
token_ids = tokenizer.encode(text)
print(token_ids)

# Decode back
decoded = tokenizer.decode(token_ids)
print(decoded)

Model Details

  • Model Type: Thai Tokenizer
  • Language: Thai (th)
  • Vocab Size: 112
  • Max Length: 512

Training Data

This tokenizer was trained on basic Thai character sets and common patterns.

Limitations

  • Basic Thai word segmentation (can be improved with pythainlp)
  • Simple vocabulary (expandable for specific use cases)

Contact

For questions or issues, please visit the GitHub repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support