XLM-RoBERTa Khmer Masked Language Model
This is a Pretrain Language Model using XLM-RoBERTa Architecture for Khmer & English language, trained for masked language modeling tasks.
Model Details
- Model Type: XLM-RoBERTa for Masked Language Modeling
- Language: Khmer (km)
- Base Model: xlm-roberta-base
- Training Data: Khmer & English dataset with 31M examples with total 6Billion characters
- Parameters: 163M trainable parameters
- Training Steps: 1,122,978
- Final Checkpoint: Step 1950500
Training Details
- Training Examples: 31Million example approximatly 8GB
- Epochs: 100
- Batch Size: 16 (with DataParallel)
- Gradient Accumulation: 1
- Total Optimization Steps: 14,509,200
- Learning Rate: ~2e-5 (with scheduler)
- Hardware: Training on single server with 4GPUs
- Training time: I trained this model for 10 Days
Training Metrics
- Final Training Loss: 2.3641
- Final Learning Rate: 1.73e-05
- Final Gradient Norm: 5.9456
- Training Epoch: 13.44
Usage
Fill-Mask Pipeline
from transformers import pipeline
# Load the model
fill_mask = pipeline("fill-mask", model="metythorn/khmer-xlm-roberta-base")
# Example usage
result = fill_mask("ខ្ញុំចង់<mask>ភាសាខ្មែរ")
print(result)
Direct Model Usage
from transformers import AutoTokenizer, AutoModelForMaskedLM
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("metythorn/khmer-xlm-roberta-base")
model = AutoModelForMaskedLM.from_pretrained("metythorn/khmer-xlm-roberta-base")
# Example usage
text = "ខ្ញុំចង់<mask>ភាសាខ្មែរ"
inputs = tokenizer(text, return_tensors="pt")
# Get predictions for masked token
outputs = model(**inputs)
predictions = outputs.logits
print("Model loaded successfully!")
Intended Use
This model is designed for:
- Fill-mask tasks in Khmer language
- Feature extraction for Khmer text
- Fine-tuning on downstream Khmer NLP tasks
- Research in Khmer language understanding
Limitations
- Primarily trained on Khmer text patterns
- May not handle code-switching effectively
- Performance may vary between formal and informal Khmer
- Limited exposure to technical or domain-specific vocabulary
Training Data
The model was trained on a custom Khmer dataset containing various text sources to ensure broad language coverage.
Evaluation
Use this model for masked language modeling evaluation:
from transformers import pipeline
import numpy as np
# Load model
fill_mask = pipeline("fill-mask", model="metythorn/khmer-xlm-roberta-base")
# Test examples
test_sentences = [
"ប្រទេសកម្ពុជាមាន<mask>ខេត្ត",
"រាជធានីភ្នំពេញគឺជ<mask>របស់ប្រទេសកម្ពុជា",
"ខ្ញុំចង់<mask>សៀវភៅ"
]
for sentence in test_sentences:
result = fill_mask(sentence)
print(f"Input: {sentence}")
print(f"Top prediction: {result[0]['token_str']}")
print("---")
Citation
If you use this model in your research, please cite:
@misc{xlm-roberta-khmer,
title={XLM-RoBERTa Khmer Masked Language Model},
author={Your Name},
year={2025},
url={https://huggingface.co/metythorn/khmer-xlm-roberta-base}
}
- Downloads last month
- 1,522
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for metythorn/khmer-xlm-roberta-base
Base model
FacebookAI/xlm-roberta-base