metadata

language:
  - km
license: mit
tags:
  - khmer
  - homophone-correction
  - text-generation
  - seq2seq
  - prahokbart
datasets:
  - custom-khmer-homophone
metrics:
  - bleu
  - wer
pipeline_tag: text2text-generation
base_model:
  - nict-astrec-att/prahokbart_big

Khmer Homophone Corrector

A fine-tuned PrahokBART model specifically designed for correcting homophones in Khmer text. This model builds upon PrahokBART, a pre-trained sequence-to-sequence model for Khmer natural language generation, and addresses the unique challenges of Khmer language processing, including word boundary issues and homophone confusion.

Model Description

Developed by: Socheata Sokhachan
Model type: PrahokBART (fine-tuned for homophone correction)
Base Model: PrahokBART
Language: Khmer (km)
Repository: GitHub
Live Demo: Streamlit App

Intended Uses & Limitations

Intended Use Cases

Homophone Correction: Correcting commonly confused Khmer homophones in text
Educational Applications: Helping students learn proper Khmer spelling
Text Preprocessing: Improving text quality for downstream Khmer NLP tasks
Content Creation: Assisting writers in producing error-free Khmer content

Limitations

Language Specific: Only works with Khmer text
Homophone Focus: Designed specifically for homophone correction, not general grammar or spelling
Context Dependency: May require surrounding context for optimal corrections
Training Data Scope: Limited to the homophone pairs in the training dataset

Training and Evaluation Data

Training Data

Dataset: Custom Khmer homophone dataset
Size: 268+ homophone groups
Coverage: Common Khmer homophones across different word categories
Preprocessing: Word segmentation using Khmer NLP tools
Format: JSON with input-target pairs

Evaluation Data

Test Set: Homophone pairs not seen during training
Metrics: BLEU score, WER, and human evaluation
Validation: Cross-validation on homophone groups

Data Preprocessing

Word Segmentation: Using Khmer word tokenization (khmer_nltk.word_tokenize)
Text Normalization: Standardizing text format with special tokens
Special Tokens: Adding </s> <2km> for input and <2km> ... </s> for target
Sequence Format: Converting to sequence-to-sequence format
Padding: Max length 128 tokens with padding

Training Results

Performance Metrics

BLEU-1 Score: 99.5398
BLEU-2 Score: 99.162
BLEU-3 Score: 98.8093
BLEU-4 Score: 98.4861
WER (Word Error Rate): 0.008
Human Evaluation Score: 0.008
Final Training Loss: 0.0091
Final Validation Loss: 0.023525

Training Analysis

The model demonstrates exceptional performance and training characteristics:

Rapid Convergence: Training loss decreased dramatically from 0.6786 in epoch 1 to 0.0091 in epoch 40, showing excellent learning progression
Stable Validation: Validation loss stabilized around 0.023 after epoch 15, indicating consistent generalization performance
Outstanding Accuracy: Achieved exceptional BLEU scores with BLEU-1 reaching 99.54% and BLEU-4 at 98.49%, demonstrating near-perfect homophone correction
Minimal Error Rate: WER of 0.008 indicates extremely low word error rate, making the model highly reliable for practical applications
No Overfitting: The small and consistent gap between training (0.0091) and validation loss (0.0235) suggests excellent generalization without overfitting
Early Performance: Remarkably, the model achieved its best BLEU scores and WER as early as epoch 1, indicating the effectiveness of the PrahokBART base model for Khmer homophone correction

Training Configuration

Base Model: PrahokBART (from nict-astrec-att/prahokbart_big)
Model Architecture: PrahokBART (Khmer-specific BART variant)
Training Framework: Hugging Face Transformers
Optimizer: AdamW
Learning Rate: 3e-5
Batch Size: 32 (per device)
Training Epochs: 40
Warmup Ratio: 0.1
Weight Decay: 0.01
Mixed Precision: FP16 enabled
Evaluation Strategy: Every epoch
Save Strategy: Every epoch (best 2 checkpoints)
Max Sequence Length: 128 tokens
Resume Training: Supported with checkpoint management

Usage

Basic Usage

from transformers import MBartForConditionalGeneration, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "socheatasokhachan/khmerhomophonecorrector"
model = MBartForConditionalGeneration.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()

# Example text with homophones
text = "ខ្ញុំកំពង់នូវសកលវិទ្យាល័យ"  # Input with homophone error

# Preprocess text (word segmentation)
from khmer_nltk import word_tokenize
segmented_text = " ".join(word_tokenize(text))

# Prepare input
input_text = f"{segmented_text} </s> <2km>"
inputs = tokenizer(
    input_text,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=1024,
    add_special_tokens=True
)

# Move to device
inputs = {k: v.to(device) for k, v in inputs.items()}

# Generate correction
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_length=1024,
        num_beams=5,
        early_stopping=True,
        do_sample=False,
        no_repeat_ngram_size=3,
        forced_bos_token_id=32000,
        forced_eos_token_id=32001,
        length_penalty=1.0,
        temperature=1.0
    )

# Decode output
corrected = tokenizer.decode(outputs[0], skip_special_tokens=True)
corrected = corrected.replace("</s>", "").replace("<2km>", "").replace("▂", " ").strip()

print(f"Original: {text}")
print(f"Corrected: {corrected}")
# Expected output: ខ្ញុំកំពុងនៅសកលវិទ្យាល័យ

Using with Streamlit

import streamlit as st
from transformers import MBartForConditionalGeneration, AutoTokenizer

@st.cache_resource
def load_model():
    model = MBartForConditionalGeneration.from_pretrained("socheatasokhachan/khmerhomophonecorrector")
    tokenizer = AutoTokenizer.from_pretrained("socheatasokhachan/khmerhomophonecorrector")
    return model, tokenizer

# Load model
model, tokenizer = load_model()

# Streamlit interface
st.title("Khmer Homophone Corrector")
user_input = st.text_area("Enter Khmer text:")
if st.button("Correct"):
    # Process text and display results

Model Architecture

Base Model: PrahokBART (Khmer-specific BART variant)
Architecture: Sequence-to-Sequence Transformer
Max Sequence Length: 128 tokens
Special Features: Khmer word segmentation and normalization
Tokenization: SentencePiece with Khmer-specific preprocessing

Citation

If you use this model in your research, please cite:

@misc{sokhachan2025khmerhomophonecorrector,
  title={Khmer Homophone Corrector: A Fine-tuned PrahokBART Model for Khmer Text Correction},
  author={Socheata Sokhachan},
  year={2024},
  url={https://huggingface.co/socheatasokhachan/khmerhomophonecorrector}
}

Related Research

This model builds upon and fine-tunes the PrahokBART model:

PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation

Authors: Hour Kaing, Raj Dabre, Haiyue Song, Van-Hien Tran, Hideki Tanaka, Masao Utiyama
Published: COLING 2025
DOI: https://aclanthology.org/2025.coling-main.87.pdf
Base Model: https://huggingface.co/nict-astrec-att/prahokbart_big

Acknowledgments

The PrahokBART research team for the base model
Hugging Face for the transformers library
The Khmer NLP community for language resources
Streamlit for the web framework
Contributors to the Khmer language processing tools

Note: This model is specifically designed for Khmer language homophone correction and may not work optimally with other languages or tasks.