metythorn's picture
Add model card
d362346 verified
metadata
language:
  - km
license: apache-2.0
tags:
  - xlm-roberta
  - khmer
  - tokenizer
  - masked-lm
datasets:
  - custom-khmer-dataset
widget:
  - text: ខ្ញុំចង់<mask>ភាសាខ្មែរ

XLM-RoBERTa Khmer Custom Tokenizer

This is a custom tokenizer trained for Khmer language based on XLM-RoBERTa architecture.

Model Details

  • Model Type: XLM-RoBERTa Tokenizer
  • Language: Khmer (km)
  • Training Data: Custom Khmer dataset with ~84M examples
  • Tokenizer Type: SentencePiece-based

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("metythorn/xlm-roberta-tokenizer-100k")

# Example usage
text = "ខ្ញុំចង់រៀនភាសាខ្មែរ"
tokens = tokenizer.tokenize(text)
print(tokens)

Training Details

  • Base model: XLM-RoBERTa
  • Training examples: 84Million
  • Epochs: 3
  • Batch size: 8
  • Parameters: 278,295,186

Intended Use

This tokenizer is designed for:

  • Khmer text processing
  • Masked language modeling
  • Fine-tuning on Khmer NLP tasks
  • Research in Khmer language understanding

Limitations

  • Primarily trained on Khmer text
  • May not handle code-switching effectively
  • Performance on formal vs informal Khmer may vary