XLM-RoBERTa Khmer Masked Language Model

This is a Pretrain Language Model using XLM-RoBERTa Architecture for Khmer & English language, trained for masked language modeling tasks. Unofficially this pretrain model perform better than original FacebookAI/xlm-roberta-base for khmer context on MLM task

Model Details

Model Type: XLM-RoBERTa for Masked Language Modeling
Language: Khmer (km)
Base Model: xlm-roberta-base
Training Data: Khmer & English dataset with 84M examples with
Parameters: 93,733,648 trainable parameters
Training Steps: 1,122,978
Final Checkpoint: Step 358500

Training Details

Training Examples: 84Million example approximatly 8.2GB
Epochs: 3
Batch Size: 8 (with DataParallel)
Gradient Accumulation: 1
Total Optimization Steps: 1,122,978
Learning Rate: ~2e-5 (with scheduler)
Hardware and Training Time: Training with 4GPUs with 2Days of training

Training Metrics

Final Training Loss: 1.5163
Final Learning Rate: 6.61e-06
Final Gradient Norm: 2.9005
Training Epoch: 66.94

Usage

Fill-Mask Pipeline

from transformers import pipeline

# Load the model
fill_mask = pipeline("fill-mask", model="metythorn/khmer-xlm-roberta-base")

# Example usage
result = fill_mask("ខ្ញុំចង់<mask>ភាសាខ្មែរ")
print(result)

Direct Model Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("metythorn/khmer-xlm-roberta-base")
model = AutoModelForMaskedLM.from_pretrained("metythorn/khmer-xlm-roberta-base")

# Example usage
text = "ខ្ញុំចង់<mask>ភាសាខ្មែរ"
inputs = tokenizer(text, return_tensors="pt")

# Get predictions for masked token
outputs = model(**inputs)
predictions = outputs.logits
print("Model loaded successfully!")

Intended Use

This model is designed for:

Fill-mask tasks in Khmer language
Feature extraction for Khmer text
Fine-tuning on downstream Khmer NLP tasks
Research in Khmer language understanding

Limitations

Primarily trained on Khmer text patterns
May not handle code-switching effectively
Performance may vary between formal and informal Khmer
Limited exposure to technical or domain-specific vocabulary

Training Data

The model was trained on a custom Khmer dataset containing various text sources to ensure broad language coverage.

Evaluation

Use this model for masked language modeling evaluation:

from transformers import pipeline
import numpy as np

# Load model
fill_mask = pipeline("fill-mask", model="metythorn/khmer-xlm-roberta-base")

# Test examples
test_sentences = [
    "ប្រទេសកម្ពុជាមាន<mask>ខេត្ត",
    "រាជធានីភ្នំពេញគឺជ<mask>របស់ប្រទេសកម្ពុជា",
    "ខ្ញុំចង់<mask>សៀវភៅ"
]

for sentence in test_sentences:
    result = fill_mask(sentence)
    print(f"Input: {sentence}")
    print(f"Top prediction: {result[0]['token_str']}")
    print("---")

Citation

If you use this model in your research, please cite:

@misc{xlm-roberta-khmer,
  title={XLM-RoBERTa Khmer Masked Language Model},
  author={Your Name},
  year={2025},
  url={https://huggingface.co/metythorn/khmer-xlm-roberta-base}
}

metythorn
/

khmer-xlm-roberta-base-10k