XLM-RoBERTa Khmer Custom Tokenizer

This is a custom tokenizer trained for Khmer language based on XLM-RoBERTa architecture.

Model Details

  • Model Type: XLM-RoBERTa Tokenizer
  • Language: Khmer (km)
  • Training Data: Custom Khmer dataset with ~84M examples
  • Tokenizer Type: SentencePiece-based

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("metythorn/xlm-roberta-tokenizer-100k")

# Example usage
text = "ខ្ញុំចង់រៀនភាសាខ្មែរ"
tokens = tokenizer.tokenize(text)
print(tokens)

Training Details

  • Base model: XLM-RoBERTa
  • Training examples: 84Million
  • Epochs: 3
  • Batch size: 8
  • Parameters: 278,295,186

Intended Use

This tokenizer is designed for:

  • Khmer text processing
  • Masked language modeling
  • Fine-tuning on Khmer NLP tasks
  • Research in Khmer language understanding

Limitations

  • Primarily trained on Khmer text
  • May not handle code-switching effectively
  • Performance on formal vs informal Khmer may vary
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support