socheatasokhachan's picture
Update README.md
c8f99b2 verified
---
language:
- km
license: mit
tags:
- khmer
- homophone-correction
- text-generation
- seq2seq
- prahokbart
datasets:
- custom-khmer-homophone
metrics:
- bleu
- wer
pipeline_tag: text2text-generation
base_model:
- nict-astrec-att/prahokbart_big
---
# Khmer Homophone Corrector
A fine-tuned PrahokBART model specifically designed for correcting homophones in Khmer text. This model builds upon PrahokBART, a pre-trained sequence-to-sequence model for Khmer natural language generation, and addresses the unique challenges of Khmer language processing, including word boundary issues and homophone confusion.
## Model Description
- **Developed by:** [Socheata Sokhachan](https://github.com/SocheataSokhaChan22)
- **Model type:** PrahokBART (fine-tuned for homophone correction)
- **Base Model:** [PrahokBART](https://huggingface.co/nict-astrec-att/prahokbart_big)
- **Language:** Khmer (km)
- **Repository:** [GitHub](https://github.com/SocheataSokhaChan22/khmerhomophonecorrector)
- **Live Demo:** [Streamlit App](https://khmerhomophonecorrector.streamlit.app)
## Intended Uses & Limitations
### Intended Use Cases
- **Homophone Correction:** Correcting commonly confused Khmer homophones in text
- **Educational Applications:** Helping students learn proper Khmer spelling
- **Text Preprocessing:** Improving text quality for downstream Khmer NLP tasks
- **Content Creation:** Assisting writers in producing error-free Khmer content
### Limitations
- **Language Specific:** Only works with Khmer text
- **Homophone Focus:** Designed specifically for homophone correction, not general grammar or spelling
- **Context Dependency:** May require surrounding context for optimal corrections
- **Training Data Scope:** Limited to the homophone pairs in the training dataset
## Training and Evaluation Data
### Training Data
- **Dataset:** Custom Khmer homophone dataset
- **Size:** 268+ homophone groups
- **Coverage:** Common Khmer homophones across different word categories
- **Preprocessing:** Word segmentation using Khmer NLP tools
- **Format:** JSON with input-target pairs
### Evaluation Data
- **Test Set:** Homophone pairs not seen during training
- **Metrics:** BLEU score, WER, and human evaluation
- **Validation:** Cross-validation on homophone groups
### Data Preprocessing
1. **Word Segmentation:** Using Khmer word tokenization (`khmer_nltk.word_tokenize`)
2. **Text Normalization:** Standardizing text format with special tokens
3. **Special Tokens:** Adding `</s> <2km>` for input and `<2km> ... </s>` for target
4. **Sequence Format:** Converting to sequence-to-sequence format
5. **Padding:** Max length 128 tokens with padding
## Training Results
### Performance Metrics
- **BLEU-1 Score:** 99.5398
- **BLEU-2 Score:** 99.162
- **BLEU-3 Score:** 98.8093
- **BLEU-4 Score:** 98.4861
- **WER (Word Error Rate):** 0.008
- **Human Evaluation Score:** 0.008
- **Final Training Loss:** 0.0091
- **Final Validation Loss:** 0.023525
### Training Analysis
The model demonstrates exceptional performance and training characteristics:
- **Rapid Convergence:** Training loss decreased dramatically from 0.6786 in epoch 1 to 0.0091 in epoch 40, showing excellent learning progression
- **Stable Validation:** Validation loss stabilized around 0.023 after epoch 15, indicating consistent generalization performance
- **Outstanding Accuracy:** Achieved exceptional BLEU scores with BLEU-1 reaching 99.54% and BLEU-4 at 98.49%, demonstrating near-perfect homophone correction
- **Minimal Error Rate:** WER of 0.008 indicates extremely low word error rate, making the model highly reliable for practical applications
- **No Overfitting:** The small and consistent gap between training (0.0091) and validation loss (0.0235) suggests excellent generalization without overfitting
- **Early Performance:** Remarkably, the model achieved its best BLEU scores and WER as early as epoch 1, indicating the effectiveness of the PrahokBART base model for Khmer homophone correction
### Training Configuration
- **Base Model:** PrahokBART (from [nict-astrec-att/prahokbart_big](https://huggingface.co/nict-astrec-att/prahokbart_big))
- **Model Architecture:** PrahokBART (Khmer-specific BART variant)
- **Training Framework:** Hugging Face Transformers
- **Optimizer:** AdamW
- **Learning Rate:** 3e-5
- **Batch Size:** 32 (per device)
- **Training Epochs:** 40
- **Warmup Ratio:** 0.1
- **Weight Decay:** 0.01
- **Mixed Precision:** FP16 enabled
- **Evaluation Strategy:** Every epoch
- **Save Strategy:** Every epoch (best 2 checkpoints)
- **Max Sequence Length:** 128 tokens
- **Resume Training:** Supported with checkpoint management
## Usage
### Basic Usage
```python
from transformers import MBartForConditionalGeneration, AutoTokenizer
import torch
# Load model and tokenizer
model_name = "socheatasokhachan/khmerhomophonecorrector"
model = MBartForConditionalGeneration.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()
# Example text with homophones
text = "αžαŸ’αž‰αž»αŸ†αž€αŸ†αž–αž„αŸ‹αž“αžΌαžœαžŸαž€αž›αžœαž·αž‘αŸ’αž™αžΆαž›αŸαž™" # Input with homophone error
# Preprocess text (word segmentation)
from khmer_nltk import word_tokenize
segmented_text = " ".join(word_tokenize(text))
# Prepare input
input_text = f"{segmented_text} </s> <2km>"
inputs = tokenizer(
input_text,
return_tensors="pt",
padding=True,
truncation=True,
max_length=1024,
add_special_tokens=True
)
# Move to device
inputs = {k: v.to(device) for k, v in inputs.items()}
# Generate correction
with torch.no_grad():
outputs = model.generate(
**inputs,
max_length=1024,
num_beams=5,
early_stopping=True,
do_sample=False,
no_repeat_ngram_size=3,
forced_bos_token_id=32000,
forced_eos_token_id=32001,
length_penalty=1.0,
temperature=1.0
)
# Decode output
corrected = tokenizer.decode(outputs[0], skip_special_tokens=True)
corrected = corrected.replace("</s>", "").replace("<2km>", "").replace("β–‚", " ").strip()
print(f"Original: {text}")
print(f"Corrected: {corrected}")
# Expected output: αžαŸ’αž‰αž»αŸ†αž€αŸ†αž–αž»αž„αž“αŸ…αžŸαž€αž›αžœαž·αž‘αŸ’αž™αžΆαž›αŸαž™
```
### Using with Streamlit
```python
import streamlit as st
from transformers import MBartForConditionalGeneration, AutoTokenizer
@st.cache_resource
def load_model():
model = MBartForConditionalGeneration.from_pretrained("socheatasokhachan/khmerhomophonecorrector")
tokenizer = AutoTokenizer.from_pretrained("socheatasokhachan/khmerhomophonecorrector")
return model, tokenizer
# Load model
model, tokenizer = load_model()
# Streamlit interface
st.title("Khmer Homophone Corrector")
user_input = st.text_area("Enter Khmer text:")
if st.button("Correct"):
# Process text and display results
```
## Model Architecture
- **Base Model:** PrahokBART (Khmer-specific BART variant)
- **Architecture:** Sequence-to-Sequence Transformer
- **Max Sequence Length:** 128 tokens
- **Special Features:** Khmer word segmentation and normalization
- **Tokenization:** SentencePiece with Khmer-specific preprocessing
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{sokhachan2025khmerhomophonecorrector,
title={Khmer Homophone Corrector: A Fine-tuned PrahokBART Model for Khmer Text Correction},
author={Socheata Sokhachan},
year={2024},
url={https://huggingface.co/socheatasokhachan/khmerhomophonecorrector}
}
```
## Related Research
This model builds upon and fine-tunes the PrahokBART model:
**PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation**
- Authors: Hour Kaing, Raj Dabre, Haiyue Song, Van-Hien Tran, Hideki Tanaka, Masao Utiyama
- Published: COLING 2025
- DOI: [https://aclanthology.org/2025.coling-main.87.pdf](https://aclanthology.org/2025.coling-main.87.pdf)
- Base Model: [https://huggingface.co/nict-astrec-att/prahokbart_big](https://huggingface.co/nict-astrec-att/prahokbart_big)
## Acknowledgments
- The PrahokBART research team for the base model
- Hugging Face for the transformers library
- The Khmer NLP community for language resources
- Streamlit for the web framework
- Contributors to the Khmer language processing tools
---
**Note:** This model is specifically designed for Khmer language homophone correction and may not work optimally with other languages or tasks.