metadata
language:
- km
license: apache-2.0
tags:
- xlm-roberta
- khmer
- tokenizer
- masked-lm
datasets:
- custom-khmer-dataset
widget:
- text: ខ្ញុំចង់<mask>ភាសាខ្មែរ
XLM-RoBERTa Khmer Custom Tokenizer
This is a custom tokenizer trained for Khmer language based on XLM-RoBERTa architecture.
Model Details
- Model Type: XLM-RoBERTa Tokenizer
- Language: Khmer (km)
- Training Data: Custom Khmer dataset with ~84M examples
- Tokenizer Type: SentencePiece-based
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("metythorn/xlm-roberta-tokenizer-100k")
# Example usage
text = "ខ្ញុំចង់រៀនភាសាខ្មែរ"
tokens = tokenizer.tokenize(text)
print(tokens)
Training Details
- Base model: XLM-RoBERTa
- Training examples: 84Million
- Epochs: 3
- Batch size: 8
- Parameters: 278,295,186
Intended Use
This tokenizer is designed for:
- Khmer text processing
- Masked language modeling
- Fine-tuning on Khmer NLP tasks
- Research in Khmer language understanding
Limitations
- Primarily trained on Khmer text
- May not handle code-switching effectively
- Performance on formal vs informal Khmer may vary