XLM-RoBERTa Khmer Custom Tokenizer
This is a custom tokenizer trained for Khmer language based on XLM-RoBERTa architecture.
Model Details
- Model Type: XLM-RoBERTa Tokenizer
- Language: Khmer (km)
- Training Data: Custom Khmer dataset with ~84M examples
- Tokenizer Type: SentencePiece-based
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("metythorn/xlm-roberta-tokenizer-100k")
# Example usage
text = "ខ្ញុំចង់រៀនភាសាខ្មែរ"
tokens = tokenizer.tokenize(text)
print(tokens)
Training Details
- Base model: XLM-RoBERTa
- Training examples: 84Million
- Epochs: 3
- Batch size: 8
- Parameters: 278,295,186
Intended Use
This tokenizer is designed for:
- Khmer text processing
- Masked language modeling
- Fine-tuning on Khmer NLP tasks
- Research in Khmer language understanding
Limitations
- Primarily trained on Khmer text
- May not handle code-switching effectively
- Performance on formal vs informal Khmer may vary
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support