XLM-RoBERTa Khmer Masked Language Model

This is a Pretrain Language Model using XLM-RoBERTa Architecture for Khmer & English language, trained for masked language modeling tasks.

Model Details

Model Type: XLM-RoBERTa for Masked Language Modeling
Language: Khmer (km)
Base Model: xlm-roberta-base
Training Data: Khmer & English dataset with 31M examples with total 6Billion characters
Parameters: 163M trainable parameters
Training Steps: 1,122,978
Final Checkpoint: Step 1950500

Training Details

Training Examples: 31Million example approximatly 8GB
Epochs: 100
Batch Size: 16 (with DataParallel)
Gradient Accumulation: 1
Total Optimization Steps: 14,509,200
Learning Rate: ~2e-5 (with scheduler)
Hardware: Training on single server with 4GPUs
Training time: I trained this model for 10 Days

Training Metrics

Final Training Loss: 2.3641
Final Learning Rate: 1.73e-05
Final Gradient Norm: 5.9456
Training Epoch: 13.44

Usage

Fill-Mask Pipeline

from transformers import pipeline

# Load the model
fill_mask = pipeline("fill-mask", model="metythorn/khmer-xlm-roberta-base")

# Example usage
result = fill_mask("ខ្ញុំចង់<mask>ភាសាខ្មែរ")
print(result)

Direct Model Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("metythorn/khmer-xlm-roberta-base")
model = AutoModelForMaskedLM.from_pretrained("metythorn/khmer-xlm-roberta-base")

# Example usage
text = "ខ្ញុំចង់<mask>ភាសាខ្មែរ"
inputs = tokenizer(text, return_tensors="pt")

# Get predictions for masked token
outputs = model(**inputs)
predictions = outputs.logits
print("Model loaded successfully!")

Intended Use

This model is designed for:

Fill-mask tasks in Khmer language
Feature extraction for Khmer text
Fine-tuning on downstream Khmer NLP tasks
Research in Khmer language understanding

Limitations

Primarily trained on Khmer text patterns
May not handle code-switching effectively
Performance may vary between formal and informal Khmer
Limited exposure to technical or domain-specific vocabulary

Training Data

The model was trained on a custom Khmer dataset containing various text sources to ensure broad language coverage.

Evaluation

Use this model for masked language modeling evaluation:

from transformers import pipeline
import numpy as np

# Load model
fill_mask = pipeline("fill-mask", model="metythorn/khmer-xlm-roberta-base")

# Test examples
test_sentences = [
    "ប្រទេសកម្ពុជាមាន<mask>ខេត្ត",
    "រាជធានីភ្នំពេញគឺជ<mask>របស់ប្រទេសកម្ពុជា",
    "ខ្ញុំចង់<mask>សៀវភៅ"
]

for sentence in test_sentences:
    result = fill_mask(sentence)
    print(f"Input: {sentence}")
    print(f"Top prediction: {result[0]['token_str']}")
    print("---")

Citation

If you use this model in your research, please cite:

@misc{xlm-roberta-khmer,
  title={XLM-RoBERTa Khmer Masked Language Model},
  author={Your Name},
  year={2025},
  url={https://huggingface.co/metythorn/khmer-xlm-roberta-base}
}

metythorn
/

khmer-xlm-roberta-base