XLM-RoBERTa Khmer Masked Language Model

This is a Pretrain Language Model using XLM-RoBERTa Architecture for Khmer & English language, trained for masked language modeling tasks. Unofficially this pretrain model perform better than original FacebookAI/xlm-roberta-base for khmer context on MLM task

Model Details

  • Model Type: XLM-RoBERTa for Masked Language Modeling
  • Language: Khmer (km)
  • Base Model: xlm-roberta-base
  • Training Data: Khmer & English dataset with 84M examples with
  • Parameters: 93,733,648 trainable parameters
  • Training Steps: 1,122,978
  • Final Checkpoint: Step 358500

Training Details

  • Training Examples: 84Million example approximatly 8.2GB
  • Epochs: 3
  • Batch Size: 8 (with DataParallel)
  • Gradient Accumulation: 1
  • Total Optimization Steps: 1,122,978
  • Learning Rate: ~2e-5 (with scheduler)
  • Hardware and Training Time: Training with 4GPUs with 2Days of training

Training Metrics

  • Final Training Loss: 1.5163
  • Final Learning Rate: 6.61e-06
  • Final Gradient Norm: 2.9005
  • Training Epoch: 66.94

Usage

Fill-Mask Pipeline

from transformers import pipeline

# Load the model
fill_mask = pipeline("fill-mask", model="metythorn/khmer-xlm-roberta-base")

# Example usage
result = fill_mask("ខ្ញុំចង់<mask>ភាសាខ្មែរ")
print(result)

Direct Model Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("metythorn/khmer-xlm-roberta-base")
model = AutoModelForMaskedLM.from_pretrained("metythorn/khmer-xlm-roberta-base")

# Example usage
text = "ខ្ញុំចង់<mask>ភាសាខ្មែរ"
inputs = tokenizer(text, return_tensors="pt")

# Get predictions for masked token
outputs = model(**inputs)
predictions = outputs.logits
print("Model loaded successfully!")

Intended Use

This model is designed for:

  • Fill-mask tasks in Khmer language
  • Feature extraction for Khmer text
  • Fine-tuning on downstream Khmer NLP tasks
  • Research in Khmer language understanding

Limitations

  • Primarily trained on Khmer text patterns
  • May not handle code-switching effectively
  • Performance may vary between formal and informal Khmer
  • Limited exposure to technical or domain-specific vocabulary

Training Data

The model was trained on a custom Khmer dataset containing various text sources to ensure broad language coverage.

Evaluation

Use this model for masked language modeling evaluation:

from transformers import pipeline
import numpy as np

# Load model
fill_mask = pipeline("fill-mask", model="metythorn/khmer-xlm-roberta-base")

# Test examples
test_sentences = [
    "ប្រទេសកម្ពុជាមាន<mask>ខេត្ត",
    "រាជធានីភ្នំពេញគឺជ<mask>របស់ប្រទេសកម្ពុជា",
    "ខ្ញុំចង់<mask>សៀវភៅ"
]

for sentence in test_sentences:
    result = fill_mask(sentence)
    print(f"Input: {sentence}")
    print(f"Top prediction: {result[0]['token_str']}")
    print("---")

Citation

If you use this model in your research, please cite:

@misc{xlm-roberta-khmer,
  title={XLM-RoBERTa Khmer Masked Language Model},
  author={Your Name},
  year={2025},
  url={https://huggingface.co/metythorn/khmer-xlm-roberta-base}
}
Downloads last month
1,030
Safetensors
Model size
93.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for metythorn/khmer-xlm-roberta-base-10k

Finetuned
(3235)
this model