XLM-RoBERTa Khmer Masked Language Model

This is a Pretrain Language Model using XLM-RoBERTa Architecture for Khmer & English language, trained for masked language modeling tasks.

Model Details

  • Model Type: XLM-RoBERTa for Masked Language Modeling
  • Language: Khmer (km)
  • Base Model: xlm-roberta-base
  • Training Data: Khmer & English dataset with 31M examples with total 6Billion characters
  • Parameters: 163M trainable parameters
  • Training Steps: 1,122,978
  • Final Checkpoint: Step 1950500

Training Details

  • Training Examples: 31Million example approximatly 8GB
  • Epochs: 100
  • Batch Size: 16 (with DataParallel)
  • Gradient Accumulation: 1
  • Total Optimization Steps: 14,509,200
  • Learning Rate: ~2e-5 (with scheduler)
  • Hardware: Training on single server with 4GPUs
  • Training time: I trained this model for 10 Days

Training Metrics

  • Final Training Loss: 2.3641
  • Final Learning Rate: 1.73e-05
  • Final Gradient Norm: 5.9456
  • Training Epoch: 13.44

Usage

Fill-Mask Pipeline

from transformers import pipeline

# Load the model
fill_mask = pipeline("fill-mask", model="metythorn/khmer-xlm-roberta-base")

# Example usage
result = fill_mask("ខ្ញុំចង់<mask>ភាសាខ្មែរ")
print(result)

Direct Model Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("metythorn/khmer-xlm-roberta-base")
model = AutoModelForMaskedLM.from_pretrained("metythorn/khmer-xlm-roberta-base")

# Example usage
text = "ខ្ញុំចង់<mask>ភាសាខ្មែរ"
inputs = tokenizer(text, return_tensors="pt")

# Get predictions for masked token
outputs = model(**inputs)
predictions = outputs.logits
print("Model loaded successfully!")

Intended Use

This model is designed for:

  • Fill-mask tasks in Khmer language
  • Feature extraction for Khmer text
  • Fine-tuning on downstream Khmer NLP tasks
  • Research in Khmer language understanding

Limitations

  • Primarily trained on Khmer text patterns
  • May not handle code-switching effectively
  • Performance may vary between formal and informal Khmer
  • Limited exposure to technical or domain-specific vocabulary

Training Data

The model was trained on a custom Khmer dataset containing various text sources to ensure broad language coverage.

Evaluation

Use this model for masked language modeling evaluation:

from transformers import pipeline
import numpy as np

# Load model
fill_mask = pipeline("fill-mask", model="metythorn/khmer-xlm-roberta-base")

# Test examples
test_sentences = [
    "ប្រទេសកម្ពុជាមាន<mask>ខេត្ត",
    "រាជធានីភ្នំពេញគឺជ<mask>របស់ប្រទេសកម្ពុជា",
    "ខ្ញុំចង់<mask>សៀវភៅ"
]

for sentence in test_sentences:
    result = fill_mask(sentence)
    print(f"Input: {sentence}")
    print(f"Top prediction: {result[0]['token_str']}")
    print("---")

Citation

If you use this model in your research, please cite:

@misc{xlm-roberta-khmer,
  title={XLM-RoBERTa Khmer Masked Language Model},
  author={Your Name},
  year={2025},
  url={https://huggingface.co/metythorn/khmer-xlm-roberta-base}
}
Downloads last month
1,522
Safetensors
Model size
163M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for metythorn/khmer-xlm-roberta-base

Finetuned
(3235)
this model