metadata
language:
- km
license: mit
tags:
- khmer
- homophone-correction
- text-generation
- seq2seq
- prahokbart
datasets:
- custom-khmer-homophone
metrics:
- bleu
- wer
pipeline_tag: text2text-generation
base_model:
- nict-astrec-att/prahokbart_big
Khmer Homophone Corrector
A fine-tuned PrahokBART model specifically designed for correcting homophones in Khmer text. This model builds upon PrahokBART, a pre-trained sequence-to-sequence model for Khmer natural language generation, and addresses the unique challenges of Khmer language processing, including word boundary issues and homophone confusion.
Model Description
- Developed by: Socheata Sokhachan
- Model type: PrahokBART (fine-tuned for homophone correction)
- Base Model: PrahokBART
- Language: Khmer (km)
- Repository: GitHub
- Live Demo: Streamlit App
Intended Uses & Limitations
Intended Use Cases
- Homophone Correction: Correcting commonly confused Khmer homophones in text
- Educational Applications: Helping students learn proper Khmer spelling
- Text Preprocessing: Improving text quality for downstream Khmer NLP tasks
- Content Creation: Assisting writers in producing error-free Khmer content
Limitations
- Language Specific: Only works with Khmer text
- Homophone Focus: Designed specifically for homophone correction, not general grammar or spelling
- Context Dependency: May require surrounding context for optimal corrections
- Training Data Scope: Limited to the homophone pairs in the training dataset
Training and Evaluation Data
Training Data
- Dataset: Custom Khmer homophone dataset
- Size: 268+ homophone groups
- Coverage: Common Khmer homophones across different word categories
- Preprocessing: Word segmentation using Khmer NLP tools
- Format: JSON with input-target pairs
Evaluation Data
- Test Set: Homophone pairs not seen during training
- Metrics: BLEU score, WER, and human evaluation
- Validation: Cross-validation on homophone groups
Data Preprocessing
- Word Segmentation: Using Khmer word tokenization (
khmer_nltk.word_tokenize
) - Text Normalization: Standardizing text format with special tokens
- Special Tokens: Adding
</s> <2km>
for input and<2km> ... </s>
for target - Sequence Format: Converting to sequence-to-sequence format
- Padding: Max length 128 tokens with padding
Training Results
Performance Metrics
- BLEU-1 Score: 99.5398
- BLEU-2 Score: 99.162
- BLEU-3 Score: 98.8093
- BLEU-4 Score: 98.4861
- WER (Word Error Rate): 0.008
- Human Evaluation Score: 0.008
- Final Training Loss: 0.0091
- Final Validation Loss: 0.023525
Training Analysis
The model demonstrates exceptional performance and training characteristics:
- Rapid Convergence: Training loss decreased dramatically from 0.6786 in epoch 1 to 0.0091 in epoch 40, showing excellent learning progression
- Stable Validation: Validation loss stabilized around 0.023 after epoch 15, indicating consistent generalization performance
- Outstanding Accuracy: Achieved exceptional BLEU scores with BLEU-1 reaching 99.54% and BLEU-4 at 98.49%, demonstrating near-perfect homophone correction
- Minimal Error Rate: WER of 0.008 indicates extremely low word error rate, making the model highly reliable for practical applications
- No Overfitting: The small and consistent gap between training (0.0091) and validation loss (0.0235) suggests excellent generalization without overfitting
- Early Performance: Remarkably, the model achieved its best BLEU scores and WER as early as epoch 1, indicating the effectiveness of the PrahokBART base model for Khmer homophone correction
Training Configuration
- Base Model: PrahokBART (from nict-astrec-att/prahokbart_big)
- Model Architecture: PrahokBART (Khmer-specific BART variant)
- Training Framework: Hugging Face Transformers
- Optimizer: AdamW
- Learning Rate: 3e-5
- Batch Size: 32 (per device)
- Training Epochs: 40
- Warmup Ratio: 0.1
- Weight Decay: 0.01
- Mixed Precision: FP16 enabled
- Evaluation Strategy: Every epoch
- Save Strategy: Every epoch (best 2 checkpoints)
- Max Sequence Length: 128 tokens
- Resume Training: Supported with checkpoint management
Usage
Basic Usage
from transformers import MBartForConditionalGeneration, AutoTokenizer
import torch
# Load model and tokenizer
model_name = "socheatasokhachan/khmerhomophonecorrector"
model = MBartForConditionalGeneration.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()
# Example text with homophones
text = "αααα»ααααααααΌαααααα·ααααΆααα" # Input with homophone error
# Preprocess text (word segmentation)
from khmer_nltk import word_tokenize
segmented_text = " ".join(word_tokenize(text))
# Prepare input
input_text = f"{segmented_text} </s> <2km>"
inputs = tokenizer(
input_text,
return_tensors="pt",
padding=True,
truncation=True,
max_length=1024,
add_special_tokens=True
)
# Move to device
inputs = {k: v.to(device) for k, v in inputs.items()}
# Generate correction
with torch.no_grad():
outputs = model.generate(
**inputs,
max_length=1024,
num_beams=5,
early_stopping=True,
do_sample=False,
no_repeat_ngram_size=3,
forced_bos_token_id=32000,
forced_eos_token_id=32001,
length_penalty=1.0,
temperature=1.0
)
# Decode output
corrected = tokenizer.decode(outputs[0], skip_special_tokens=True)
corrected = corrected.replace("</s>", "").replace("<2km>", "").replace("β", " ").strip()
print(f"Original: {text}")
print(f"Corrected: {corrected}")
# Expected output: αααα»ααααα»ααα
ααααα·ααααΆααα
Using with Streamlit
import streamlit as st
from transformers import MBartForConditionalGeneration, AutoTokenizer
@st.cache_resource
def load_model():
model = MBartForConditionalGeneration.from_pretrained("socheatasokhachan/khmerhomophonecorrector")
tokenizer = AutoTokenizer.from_pretrained("socheatasokhachan/khmerhomophonecorrector")
return model, tokenizer
# Load model
model, tokenizer = load_model()
# Streamlit interface
st.title("Khmer Homophone Corrector")
user_input = st.text_area("Enter Khmer text:")
if st.button("Correct"):
# Process text and display results
Model Architecture
- Base Model: PrahokBART (Khmer-specific BART variant)
- Architecture: Sequence-to-Sequence Transformer
- Max Sequence Length: 128 tokens
- Special Features: Khmer word segmentation and normalization
- Tokenization: SentencePiece with Khmer-specific preprocessing
Citation
If you use this model in your research, please cite:
@misc{sokhachan2025khmerhomophonecorrector,
title={Khmer Homophone Corrector: A Fine-tuned PrahokBART Model for Khmer Text Correction},
author={Socheata Sokhachan},
year={2024},
url={https://huggingface.co/socheatasokhachan/khmerhomophonecorrector}
}
Related Research
This model builds upon and fine-tunes the PrahokBART model:
PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation
- Authors: Hour Kaing, Raj Dabre, Haiyue Song, Van-Hien Tran, Hideki Tanaka, Masao Utiyama
- Published: COLING 2025
- DOI: https://aclanthology.org/2025.coling-main.87.pdf
- Base Model: https://huggingface.co/nict-astrec-att/prahokbart_big
Acknowledgments
- The PrahokBART research team for the base model
- Hugging Face for the transformers library
- The Khmer NLP community for language resources
- Streamlit for the web framework
- Contributors to the Khmer language processing tools
Note: This model is specifically designed for Khmer language homophone correction and may not work optimally with other languages or tasks.