--- language: - km license: apache-2.0 tags: - xlm-roberta - khmer - tokenizer - masked-lm datasets: - custom-khmer-dataset widget: - text: "ខ្ញុំចង់ភាសាខ្មែរ" --- # XLM-RoBERTa Khmer Custom Tokenizer This is a custom tokenizer trained for Khmer language based on XLM-RoBERTa architecture. ## Model Details - **Model Type**: XLM-RoBERTa Tokenizer - **Language**: Khmer (km) - **Training Data**: Custom Khmer dataset with ~84M examples - **Tokenizer Type**: SentencePiece-based ## Usage ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("metythorn/xlm-roberta-tokenizer-100k") # Example usage text = "ខ្ញុំចង់រៀនភាសាខ្មែរ" tokens = tokenizer.tokenize(text) print(tokens) ``` ## Training Details - Base model: XLM-RoBERTa - Training examples: 84Million - Epochs: 3 - Batch size: 8 - Parameters: 278,295,186 ## Intended Use This tokenizer is designed for: - Khmer text processing - Masked language modeling - Fine-tuning on Khmer NLP tasks - Research in Khmer language understanding ## Limitations - Primarily trained on Khmer text - May not handle code-switching effectively - Performance on formal vs informal Khmer may vary