Khmer Homophone Corrector

A fine-tuned PrahokBART model specifically designed for correcting homophones in Khmer text. This model builds upon PrahokBART, a pre-trained sequence-to-sequence model for Khmer natural language generation, and addresses the unique challenges of Khmer language processing, including word boundary issues and homophone confusion.

Model Description

Intended Uses & Limitations

Intended Use Cases

  • Homophone Correction: Correcting commonly confused Khmer homophones in text
  • Educational Applications: Helping students learn proper Khmer spelling
  • Text Preprocessing: Improving text quality for downstream Khmer NLP tasks
  • Content Creation: Assisting writers in producing error-free Khmer content

Limitations

  • Language Specific: Only works with Khmer text
  • Homophone Focus: Designed specifically for homophone correction, not general grammar or spelling
  • Context Dependency: May require surrounding context for optimal corrections
  • Training Data Scope: Limited to the homophone pairs in the training dataset

Training and Evaluation Data

Training Data

  • Dataset: Custom Khmer homophone dataset
  • Size: 268+ homophone groups
  • Coverage: Common Khmer homophones across different word categories
  • Preprocessing: Word segmentation using Khmer NLP tools
  • Format: JSON with input-target pairs

Evaluation Data

  • Test Set: Homophone pairs not seen during training
  • Metrics: BLEU score, WER, and human evaluation
  • Validation: Cross-validation on homophone groups

Data Preprocessing

  1. Word Segmentation: Using Khmer word tokenization (khmer_nltk.word_tokenize)
  2. Text Normalization: Standardizing text format with special tokens
  3. Special Tokens: Adding </s> <2km> for input and <2km> ... </s> for target
  4. Sequence Format: Converting to sequence-to-sequence format
  5. Padding: Max length 128 tokens with padding

Training Results

Performance Metrics

  • BLEU-1 Score: 99.5398
  • BLEU-2 Score: 99.162
  • BLEU-3 Score: 98.8093
  • BLEU-4 Score: 98.4861
  • WER (Word Error Rate): 0.008
  • Human Evaluation Score: 0.008
  • Final Training Loss: 0.0091
  • Final Validation Loss: 0.023525

Training Analysis

The model demonstrates exceptional performance and training characteristics:

  • Rapid Convergence: Training loss decreased dramatically from 0.6786 in epoch 1 to 0.0091 in epoch 40, showing excellent learning progression
  • Stable Validation: Validation loss stabilized around 0.023 after epoch 15, indicating consistent generalization performance
  • Outstanding Accuracy: Achieved exceptional BLEU scores with BLEU-1 reaching 99.54% and BLEU-4 at 98.49%, demonstrating near-perfect homophone correction
  • Minimal Error Rate: WER of 0.008 indicates extremely low word error rate, making the model highly reliable for practical applications
  • No Overfitting: The small and consistent gap between training (0.0091) and validation loss (0.0235) suggests excellent generalization without overfitting
  • Early Performance: Remarkably, the model achieved its best BLEU scores and WER as early as epoch 1, indicating the effectiveness of the PrahokBART base model for Khmer homophone correction

Training Configuration

  • Base Model: PrahokBART (from nict-astrec-att/prahokbart_big)
  • Model Architecture: PrahokBART (Khmer-specific BART variant)
  • Training Framework: Hugging Face Transformers
  • Optimizer: AdamW
  • Learning Rate: 3e-5
  • Batch Size: 32 (per device)
  • Training Epochs: 40
  • Warmup Ratio: 0.1
  • Weight Decay: 0.01
  • Mixed Precision: FP16 enabled
  • Evaluation Strategy: Every epoch
  • Save Strategy: Every epoch (best 2 checkpoints)
  • Max Sequence Length: 128 tokens
  • Resume Training: Supported with checkpoint management

Usage

Basic Usage

from transformers import MBartForConditionalGeneration, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "socheatasokhachan/khmerhomophonecorrector"
model = MBartForConditionalGeneration.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()

# Example text with homophones
text = "αžαŸ’αž‰αž»αŸ†αž€αŸ†αž–αž„αŸ‹αž“αžΌαžœαžŸαž€αž›αžœαž·αž‘αŸ’αž™αžΆαž›αŸαž™"  # Input with homophone error

# Preprocess text (word segmentation)
from khmer_nltk import word_tokenize
segmented_text = " ".join(word_tokenize(text))

# Prepare input
input_text = f"{segmented_text} </s> <2km>"
inputs = tokenizer(
    input_text,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=1024,
    add_special_tokens=True
)

# Move to device
inputs = {k: v.to(device) for k, v in inputs.items()}

# Generate correction
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_length=1024,
        num_beams=5,
        early_stopping=True,
        do_sample=False,
        no_repeat_ngram_size=3,
        forced_bos_token_id=32000,
        forced_eos_token_id=32001,
        length_penalty=1.0,
        temperature=1.0
    )

# Decode output
corrected = tokenizer.decode(outputs[0], skip_special_tokens=True)
corrected = corrected.replace("</s>", "").replace("<2km>", "").replace("β–‚", " ").strip()

print(f"Original: {text}")
print(f"Corrected: {corrected}")
# Expected output: αžαŸ’αž‰αž»αŸ†αž€αŸ†αž–αž»αž„αž“αŸ…αžŸαž€αž›αžœαž·αž‘αŸ’αž™αžΆαž›αŸαž™

Using with Streamlit

import streamlit as st
from transformers import MBartForConditionalGeneration, AutoTokenizer

@st.cache_resource
def load_model():
    model = MBartForConditionalGeneration.from_pretrained("socheatasokhachan/khmerhomophonecorrector")
    tokenizer = AutoTokenizer.from_pretrained("socheatasokhachan/khmerhomophonecorrector")
    return model, tokenizer

# Load model
model, tokenizer = load_model()

# Streamlit interface
st.title("Khmer Homophone Corrector")
user_input = st.text_area("Enter Khmer text:")
if st.button("Correct"):
    # Process text and display results

Model Architecture

  • Base Model: PrahokBART (Khmer-specific BART variant)
  • Architecture: Sequence-to-Sequence Transformer
  • Max Sequence Length: 128 tokens
  • Special Features: Khmer word segmentation and normalization
  • Tokenization: SentencePiece with Khmer-specific preprocessing

Citation

If you use this model in your research, please cite:

@misc{sokhachan2025khmerhomophonecorrector,
  title={Khmer Homophone Corrector: A Fine-tuned PrahokBART Model for Khmer Text Correction},
  author={Socheata Sokhachan},
  year={2024},
  url={https://huggingface.co/socheatasokhachan/khmerhomophonecorrector}
}

Related Research

This model builds upon and fine-tunes the PrahokBART model:

PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation

Acknowledgments

  • The PrahokBART research team for the base model
  • Hugging Face for the transformers library
  • The Khmer NLP community for language resources
  • Streamlit for the web framework
  • Contributors to the Khmer language processing tools

Note: This model is specifically designed for Khmer language homophone correction and may not work optimally with other languages or tasks.

Downloads last month
9
Safetensors
Model size
211M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for socheatasokhachan/khmerhomophonecorrector

Finetuned
(1)
this model