File size: 8,519 Bytes

---
language:
- km
license: mit
tags:
- khmer
- homophone-correction
- text-generation
- seq2seq
- prahokbart
datasets:
- custom-khmer-homophone
metrics:
- bleu
- wer
pipeline_tag: text2text-generation
base_model:
- nict-astrec-att/prahokbart_big
---

# Khmer Homophone Corrector

A fine-tuned PrahokBART model specifically designed for correcting homophones in Khmer text. This model builds upon PrahokBART, a pre-trained sequence-to-sequence model for Khmer natural language generation, and addresses the unique challenges of Khmer language processing, including word boundary issues and homophone confusion.

## Model Description

- **Developed by:** [Socheata Sokhachan](https://github.com/SocheataSokhaChan22)
- **Model type:** PrahokBART (fine-tuned for homophone correction)
- **Base Model:** [PrahokBART](https://huggingface.co/nict-astrec-att/prahokbart_big)
- **Language:** Khmer (km)
- **Repository:** [GitHub](https://github.com/SocheataSokhaChan22/khmerhomophonecorrector)
- **Live Demo:** [Streamlit App](https://khmerhomophonecorrector.streamlit.app)

## Intended Uses & Limitations

### Intended Use Cases
- **Homophone Correction:** Correcting commonly confused Khmer homophones in text
- **Educational Applications:** Helping students learn proper Khmer spelling
- **Text Preprocessing:** Improving text quality for downstream Khmer NLP tasks
- **Content Creation:** Assisting writers in producing error-free Khmer content

### Limitations
- **Language Specific:** Only works with Khmer text
- **Homophone Focus:** Designed specifically for homophone correction, not general grammar or spelling
- **Context Dependency:** May require surrounding context for optimal corrections
- **Training Data Scope:** Limited to the homophone pairs in the training dataset

## Training and Evaluation Data

### Training Data
- **Dataset:** Custom Khmer homophone dataset
- **Size:** 268+ homophone groups
- **Coverage:** Common Khmer homophones across different word categories
- **Preprocessing:** Word segmentation using Khmer NLP tools
- **Format:** JSON with input-target pairs

### Evaluation Data
- **Test Set:** Homophone pairs not seen during training
- **Metrics:** BLEU score, WER, and human evaluation
- **Validation:** Cross-validation on homophone groups

### Data Preprocessing
1. **Word Segmentation:** Using Khmer word tokenization (`khmer_nltk.word_tokenize`)
2. **Text Normalization:** Standardizing text format with special tokens
3. **Special Tokens:** Adding `</s> <2km>` for input and `<2km> ... </s>` for target
4. **Sequence Format:** Converting to sequence-to-sequence format
5. **Padding:** Max length 128 tokens with padding

## Training Results

### Performance Metrics
- **BLEU-1 Score:** 99.5398
- **BLEU-2 Score:** 99.162
- **BLEU-3 Score:** 98.8093
- **BLEU-4 Score:** 98.4861
- **WER (Word Error Rate):** 0.008
- **Human Evaluation Score:** 0.008
- **Final Training Loss:** 0.0091
- **Final Validation Loss:** 0.023525

### Training Analysis
The model demonstrates exceptional performance and training characteristics:

- **Rapid Convergence:** Training loss decreased dramatically from 0.6786 in epoch 1 to 0.0091 in epoch 40, showing excellent learning progression
- **Stable Validation:** Validation loss stabilized around 0.023 after epoch 15, indicating consistent generalization performance
- **Outstanding Accuracy:** Achieved exceptional BLEU scores with BLEU-1 reaching 99.54% and BLEU-4 at 98.49%, demonstrating near-perfect homophone correction
- **Minimal Error Rate:** WER of 0.008 indicates extremely low word error rate, making the model highly reliable for practical applications
- **No Overfitting:** The small and consistent gap between training (0.0091) and validation loss (0.0235) suggests excellent generalization without overfitting
- **Early Performance:** Remarkably, the model achieved its best BLEU scores and WER as early as epoch 1, indicating the effectiveness of the PrahokBART base model for Khmer homophone correction

### Training Configuration
- **Base Model:** PrahokBART (from [nict-astrec-att/prahokbart_big](https://huggingface.co/nict-astrec-att/prahokbart_big))
- **Model Architecture:** PrahokBART (Khmer-specific BART variant)
- **Training Framework:** Hugging Face Transformers
- **Optimizer:** AdamW
- **Learning Rate:** 3e-5
- **Batch Size:** 32 (per device)
- **Training Epochs:** 40
- **Warmup Ratio:** 0.1
- **Weight Decay:** 0.01
- **Mixed Precision:** FP16 enabled
- **Evaluation Strategy:** Every epoch
- **Save Strategy:** Every epoch (best 2 checkpoints)
- **Max Sequence Length:** 128 tokens
- **Resume Training:** Supported with checkpoint management

## Usage

### Basic Usage

```python
from transformers import MBartForConditionalGeneration, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "socheatasokhachan/khmerhomophonecorrector"
model = MBartForConditionalGeneration.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()

# Example text with homophones
text = "ខ្ញុំកំពង់នូវសកលវិទ្យាល័យ"  # Input with homophone error

# Preprocess text (word segmentation)
from khmer_nltk import word_tokenize
segmented_text = " ".join(word_tokenize(text))

# Prepare input
input_text = f"{segmented_text} </s> <2km>"
inputs = tokenizer(
    input_text,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=1024,
    add_special_tokens=True
)

# Move to device
inputs = {k: v.to(device) for k, v in inputs.items()}

# Generate correction
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_length=1024,
        num_beams=5,
        early_stopping=True,
        do_sample=False,
        no_repeat_ngram_size=3,
        forced_bos_token_id=32000,
        forced_eos_token_id=32001,
        length_penalty=1.0,
        temperature=1.0
    )

# Decode output
corrected = tokenizer.decode(outputs[0], skip_special_tokens=True)
corrected = corrected.replace("</s>", "").replace("<2km>", "").replace("▂", " ").strip()

print(f"Original: {text}")
print(f"Corrected: {corrected}")
# Expected output: ខ្ញុំកំពុងនៅសកលវិទ្យាល័យ
```

### Using with Streamlit

```python
import streamlit as st
from transformers import MBartForConditionalGeneration, AutoTokenizer

@st.cache_resource
def load_model():
    model = MBartForConditionalGeneration.from_pretrained("socheatasokhachan/khmerhomophonecorrector")
    tokenizer = AutoTokenizer.from_pretrained("socheatasokhachan/khmerhomophonecorrector")
    return model, tokenizer

# Load model
model, tokenizer = load_model()

# Streamlit interface
st.title("Khmer Homophone Corrector")
user_input = st.text_area("Enter Khmer text:")
if st.button("Correct"):
    # Process text and display results
```

## Model Architecture

- **Base Model:** PrahokBART (Khmer-specific BART variant)
- **Architecture:** Sequence-to-Sequence Transformer
- **Max Sequence Length:** 128 tokens
- **Special Features:** Khmer word segmentation and normalization
- **Tokenization:** SentencePiece with Khmer-specific preprocessing

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{sokhachan2025khmerhomophonecorrector,
  title={Khmer Homophone Corrector: A Fine-tuned PrahokBART Model for Khmer Text Correction},
  author={Socheata Sokhachan},
  year={2024},
  url={https://huggingface.co/socheatasokhachan/khmerhomophonecorrector}
}
```

## Related Research

This model builds upon and fine-tunes the PrahokBART model:

**PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation**
- Authors: Hour Kaing, Raj Dabre, Haiyue Song, Van-Hien Tran, Hideki Tanaka, Masao Utiyama
- Published: COLING 2025
- DOI: [https://aclanthology.org/2025.coling-main.87.pdf](https://aclanthology.org/2025.coling-main.87.pdf)
- Base Model: [https://huggingface.co/nict-astrec-att/prahokbart_big](https://huggingface.co/nict-astrec-att/prahokbart_big)

## Acknowledgments

- The PrahokBART research team for the base model
- Hugging Face for the transformers library
- The Khmer NLP community for language resources
- Streamlit for the web framework
- Contributors to the Khmer language processing tools
---

**Note:** This model is specifically designed for Khmer language homophone correction and may not work optimally with other languages or tasks.