---
language:
- km
license: apache-2.0
tags:
- xlm-roberta
- khmer
- masked-lm
- fill-mask
- pytorch
- transformers
widget:
- text: "ខ្ញុំចង់<mask>ភាសាខ្មែរ"
- text: "ប្រទេសកម្ពុជាមាន<mask>ខេត្ត"
- text: "រាជធានីភ្នំពេញគឺជ<mask>របស់ប្រទេសកម្ពុជា"
metrics:
- perplexity
base_model: xlm-roberta-base
pipeline_tag: fill-mask
---

# XLM-RoBERTa Khmer Masked Language Model

This is a Pretrain Language Model using XLM-RoBERTa Architecture for Khmer & English language, trained for masked language modeling tasks. Unofficially this pretrain model perform better than original FacebookAI/xlm-roberta-base for khmer context on MLM task

## Model Details

- **Model Type**: XLM-RoBERTa for Masked Language Modeling
- **Language**: Khmer (km)
- **Base Model**: xlm-roberta-base
- **Training Data**: Khmer & English dataset with 84M examples with
- **Parameters**: 93,733,648 trainable parameters
- **Training Steps**: 1,122,978
- **Final Checkpoint**: Step 358500

## Training Details

- **Training Examples**: 84Million example approximatly 8.2GB 
- **Epochs**: 3
- **Batch Size**: 8 (with DataParallel)
- **Gradient Accumulation**: 1
- **Total Optimization Steps**: 1,122,978
- **Learning Rate**: ~2e-5 (with scheduler)
- **Hardware and Training Time: Training with 4GPUs with 2Days of training**

## Training Metrics
- **Final Training Loss**: 1.5163
- **Final Learning Rate**: 6.61e-06
- **Final Gradient Norm**: 2.9005
- **Training Epoch**: 66.94


## Usage

### Fill-Mask Pipeline
```python
from transformers import pipeline

# Load the model
fill_mask = pipeline("fill-mask", model="metythorn/khmer-xlm-roberta-base")

# Example usage
result = fill_mask("ខ្ញុំចង់<mask>ភាសាខ្មែរ")
print(result)
```

### Direct Model Usage
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("metythorn/khmer-xlm-roberta-base")
model = AutoModelForMaskedLM.from_pretrained("metythorn/khmer-xlm-roberta-base")

# Example usage
text = "ខ្ញុំចង់<mask>ភាសាខ្មែរ"
inputs = tokenizer(text, return_tensors="pt")

# Get predictions for masked token
outputs = model(**inputs)
predictions = outputs.logits
print("Model loaded successfully!")
```

## Intended Use

This model is designed for:
- **Fill-mask tasks** in Khmer language
- **Feature extraction** for Khmer text
- **Fine-tuning** on downstream Khmer NLP tasks
- **Research** in Khmer language understanding

## Limitations

- Primarily trained on Khmer text patterns
- May not handle code-switching effectively
- Performance may vary between formal and informal Khmer
- Limited exposure to technical or domain-specific vocabulary

## Training Data

The model was trained on a custom Khmer dataset containing various text sources to ensure broad language coverage.

## Evaluation

Use this model for masked language modeling evaluation:
```python
from transformers import pipeline
import numpy as np

# Load model
fill_mask = pipeline("fill-mask", model="metythorn/khmer-xlm-roberta-base")

# Test examples
test_sentences = [
    "ប្រទេសកម្ពុជាមាន<mask>ខេត្ត",
    "រាជធានីភ្នំពេញគឺជ<mask>របស់ប្រទេសកម្ពុជា",
    "ខ្ញុំចង់<mask>សៀវភៅ"
]

for sentence in test_sentences:
    result = fill_mask(sentence)
    print(f"Input: {sentence}")
    print(f"Top prediction: {result[0]['token_str']}")
    print("---")
```

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{xlm-roberta-khmer,
  title={XLM-RoBERTa Khmer Masked Language Model},
  author={Your Name},
  year={2025},
  url={https://huggingface.co/metythorn/khmer-xlm-roberta-base}
}
```