--- language: - km license: apache-2.0 tags: - xlm-roberta - khmer - masked-lm - fill-mask - pytorch - transformers widget: - text: "ខ្ញុំចង់ភាសាខ្មែរ" - text: "ប្រទេសកម្ពុជាមានខេត្ត" - text: "រាជធានីភ្នំពេញគឺជរបស់ប្រទេសកម្ពុជា" metrics: - perplexity base_model: xlm-roberta-base pipeline_tag: fill-mask --- # XLM-RoBERTa Khmer Masked Language Model This is a Pretrain Language Model using XLM-RoBERTa Architecture for Khmer & English language, trained for masked language modeling tasks. Unofficially this pretrain model perform better than original FacebookAI/xlm-roberta-base for khmer context on MLM task ## Model Details - **Model Type**: XLM-RoBERTa for Masked Language Modeling - **Language**: Khmer (km) - **Base Model**: xlm-roberta-base - **Training Data**: Khmer & English dataset with 84M examples with - **Parameters**: 93,733,648 trainable parameters - **Training Steps**: 1,122,978 - **Final Checkpoint**: Step 358500 ## Training Details - **Training Examples**: 84Million example approximatly 8.2GB - **Epochs**: 3 - **Batch Size**: 8 (with DataParallel) - **Gradient Accumulation**: 1 - **Total Optimization Steps**: 1,122,978 - **Learning Rate**: ~2e-5 (with scheduler) - **Hardware and Training Time: Training with 4GPUs with 2Days of training** ## Training Metrics - **Final Training Loss**: 1.5163 - **Final Learning Rate**: 6.61e-06 - **Final Gradient Norm**: 2.9005 - **Training Epoch**: 66.94 ## Usage ### Fill-Mask Pipeline ```python from transformers import pipeline # Load the model fill_mask = pipeline("fill-mask", model="metythorn/khmer-xlm-roberta-base") # Example usage result = fill_mask("ខ្ញុំចង់ភាសាខ្មែរ") print(result) ``` ### Direct Model Usage ```python from transformers import AutoTokenizer, AutoModelForMaskedLM # Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained("metythorn/khmer-xlm-roberta-base") model = AutoModelForMaskedLM.from_pretrained("metythorn/khmer-xlm-roberta-base") # Example usage text = "ខ្ញុំចង់ភាសាខ្មែរ" inputs = tokenizer(text, return_tensors="pt") # Get predictions for masked token outputs = model(**inputs) predictions = outputs.logits print("Model loaded successfully!") ``` ## Intended Use This model is designed for: - **Fill-mask tasks** in Khmer language - **Feature extraction** for Khmer text - **Fine-tuning** on downstream Khmer NLP tasks - **Research** in Khmer language understanding ## Limitations - Primarily trained on Khmer text patterns - May not handle code-switching effectively - Performance may vary between formal and informal Khmer - Limited exposure to technical or domain-specific vocabulary ## Training Data The model was trained on a custom Khmer dataset containing various text sources to ensure broad language coverage. ## Evaluation Use this model for masked language modeling evaluation: ```python from transformers import pipeline import numpy as np # Load model fill_mask = pipeline("fill-mask", model="metythorn/khmer-xlm-roberta-base") # Test examples test_sentences = [ "ប្រទេសកម្ពុជាមានខេត្ត", "រាជធានីភ្នំពេញគឺជរបស់ប្រទេសកម្ពុជា", "ខ្ញុំចង់សៀវភៅ" ] for sentence in test_sentences: result = fill_mask(sentence) print(f"Input: {sentence}") print(f"Top prediction: {result[0]['token_str']}") print("---") ``` ## Citation If you use this model in your research, please cite: ```bibtex @misc{xlm-roberta-khmer, title={XLM-RoBERTa Khmer Masked Language Model}, author={Your Name}, year={2025}, url={https://huggingface.co/metythorn/khmer-xlm-roberta-base} } ```