|
--- |
|
language: |
|
- km |
|
license: mit |
|
tags: |
|
- khmer |
|
- homophone-correction |
|
- text-generation |
|
- seq2seq |
|
- prahokbart |
|
datasets: |
|
- custom-khmer-homophone |
|
metrics: |
|
- bleu |
|
- wer |
|
pipeline_tag: text2text-generation |
|
base_model: |
|
- nict-astrec-att/prahokbart_big |
|
--- |
|
|
|
# Khmer Homophone Corrector |
|
|
|
A fine-tuned PrahokBART model specifically designed for correcting homophones in Khmer text. This model builds upon PrahokBART, a pre-trained sequence-to-sequence model for Khmer natural language generation, and addresses the unique challenges of Khmer language processing, including word boundary issues and homophone confusion. |
|
|
|
## Model Description |
|
|
|
- **Developed by:** [Socheata Sokhachan](https://github.com/SocheataSokhaChan22) |
|
- **Model type:** PrahokBART (fine-tuned for homophone correction) |
|
- **Base Model:** [PrahokBART](https://huggingface.co/nict-astrec-att/prahokbart_big) |
|
- **Language:** Khmer (km) |
|
- **Repository:** [GitHub](https://github.com/SocheataSokhaChan22/khmerhomophonecorrector) |
|
- **Live Demo:** [Streamlit App](https://khmerhomophonecorrector.streamlit.app) |
|
|
|
## Intended Uses & Limitations |
|
|
|
### Intended Use Cases |
|
- **Homophone Correction:** Correcting commonly confused Khmer homophones in text |
|
- **Educational Applications:** Helping students learn proper Khmer spelling |
|
- **Text Preprocessing:** Improving text quality for downstream Khmer NLP tasks |
|
- **Content Creation:** Assisting writers in producing error-free Khmer content |
|
|
|
### Limitations |
|
- **Language Specific:** Only works with Khmer text |
|
- **Homophone Focus:** Designed specifically for homophone correction, not general grammar or spelling |
|
- **Context Dependency:** May require surrounding context for optimal corrections |
|
- **Training Data Scope:** Limited to the homophone pairs in the training dataset |
|
|
|
## Training and Evaluation Data |
|
|
|
### Training Data |
|
- **Dataset:** Custom Khmer homophone dataset |
|
- **Size:** 268+ homophone groups |
|
- **Coverage:** Common Khmer homophones across different word categories |
|
- **Preprocessing:** Word segmentation using Khmer NLP tools |
|
- **Format:** JSON with input-target pairs |
|
|
|
### Evaluation Data |
|
- **Test Set:** Homophone pairs not seen during training |
|
- **Metrics:** BLEU score, WER, and human evaluation |
|
- **Validation:** Cross-validation on homophone groups |
|
|
|
### Data Preprocessing |
|
1. **Word Segmentation:** Using Khmer word tokenization (`khmer_nltk.word_tokenize`) |
|
2. **Text Normalization:** Standardizing text format with special tokens |
|
3. **Special Tokens:** Adding `</s> <2km>` for input and `<2km> ... </s>` for target |
|
4. **Sequence Format:** Converting to sequence-to-sequence format |
|
5. **Padding:** Max length 128 tokens with padding |
|
|
|
## Training Results |
|
|
|
### Performance Metrics |
|
- **BLEU-1 Score:** 99.5398 |
|
- **BLEU-2 Score:** 99.162 |
|
- **BLEU-3 Score:** 98.8093 |
|
- **BLEU-4 Score:** 98.4861 |
|
- **WER (Word Error Rate):** 0.008 |
|
- **Human Evaluation Score:** 0.008 |
|
- **Final Training Loss:** 0.0091 |
|
- **Final Validation Loss:** 0.023525 |
|
|
|
### Training Analysis |
|
The model demonstrates exceptional performance and training characteristics: |
|
|
|
- **Rapid Convergence:** Training loss decreased dramatically from 0.6786 in epoch 1 to 0.0091 in epoch 40, showing excellent learning progression |
|
- **Stable Validation:** Validation loss stabilized around 0.023 after epoch 15, indicating consistent generalization performance |
|
- **Outstanding Accuracy:** Achieved exceptional BLEU scores with BLEU-1 reaching 99.54% and BLEU-4 at 98.49%, demonstrating near-perfect homophone correction |
|
- **Minimal Error Rate:** WER of 0.008 indicates extremely low word error rate, making the model highly reliable for practical applications |
|
- **No Overfitting:** The small and consistent gap between training (0.0091) and validation loss (0.0235) suggests excellent generalization without overfitting |
|
- **Early Performance:** Remarkably, the model achieved its best BLEU scores and WER as early as epoch 1, indicating the effectiveness of the PrahokBART base model for Khmer homophone correction |
|
|
|
### Training Configuration |
|
- **Base Model:** PrahokBART (from [nict-astrec-att/prahokbart_big](https://huggingface.co/nict-astrec-att/prahokbart_big)) |
|
- **Model Architecture:** PrahokBART (Khmer-specific BART variant) |
|
- **Training Framework:** Hugging Face Transformers |
|
- **Optimizer:** AdamW |
|
- **Learning Rate:** 3e-5 |
|
- **Batch Size:** 32 (per device) |
|
- **Training Epochs:** 40 |
|
- **Warmup Ratio:** 0.1 |
|
- **Weight Decay:** 0.01 |
|
- **Mixed Precision:** FP16 enabled |
|
- **Evaluation Strategy:** Every epoch |
|
- **Save Strategy:** Every epoch (best 2 checkpoints) |
|
- **Max Sequence Length:** 128 tokens |
|
- **Resume Training:** Supported with checkpoint management |
|
|
|
## Usage |
|
|
|
### Basic Usage |
|
|
|
```python |
|
from transformers import MBartForConditionalGeneration, AutoTokenizer |
|
import torch |
|
|
|
# Load model and tokenizer |
|
model_name = "socheatasokhachan/khmerhomophonecorrector" |
|
model = MBartForConditionalGeneration.from_pretrained(model_name) |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
|
# Set device |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model = model.to(device) |
|
model.eval() |
|
|
|
# Example text with homophones |
|
text = "αααα»ααααααααΌαααααα·ααααΆααα" # Input with homophone error |
|
|
|
# Preprocess text (word segmentation) |
|
from khmer_nltk import word_tokenize |
|
segmented_text = " ".join(word_tokenize(text)) |
|
|
|
# Prepare input |
|
input_text = f"{segmented_text} </s> <2km>" |
|
inputs = tokenizer( |
|
input_text, |
|
return_tensors="pt", |
|
padding=True, |
|
truncation=True, |
|
max_length=1024, |
|
add_special_tokens=True |
|
) |
|
|
|
# Move to device |
|
inputs = {k: v.to(device) for k, v in inputs.items()} |
|
|
|
# Generate correction |
|
with torch.no_grad(): |
|
outputs = model.generate( |
|
**inputs, |
|
max_length=1024, |
|
num_beams=5, |
|
early_stopping=True, |
|
do_sample=False, |
|
no_repeat_ngram_size=3, |
|
forced_bos_token_id=32000, |
|
forced_eos_token_id=32001, |
|
length_penalty=1.0, |
|
temperature=1.0 |
|
) |
|
|
|
# Decode output |
|
corrected = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
corrected = corrected.replace("</s>", "").replace("<2km>", "").replace("β", " ").strip() |
|
|
|
print(f"Original: {text}") |
|
print(f"Corrected: {corrected}") |
|
# Expected output: αααα»ααααα»ααα
ααααα·ααααΆααα |
|
``` |
|
|
|
### Using with Streamlit |
|
|
|
```python |
|
import streamlit as st |
|
from transformers import MBartForConditionalGeneration, AutoTokenizer |
|
|
|
@st.cache_resource |
|
def load_model(): |
|
model = MBartForConditionalGeneration.from_pretrained("socheatasokhachan/khmerhomophonecorrector") |
|
tokenizer = AutoTokenizer.from_pretrained("socheatasokhachan/khmerhomophonecorrector") |
|
return model, tokenizer |
|
|
|
# Load model |
|
model, tokenizer = load_model() |
|
|
|
# Streamlit interface |
|
st.title("Khmer Homophone Corrector") |
|
user_input = st.text_area("Enter Khmer text:") |
|
if st.button("Correct"): |
|
# Process text and display results |
|
``` |
|
|
|
## Model Architecture |
|
|
|
- **Base Model:** PrahokBART (Khmer-specific BART variant) |
|
- **Architecture:** Sequence-to-Sequence Transformer |
|
- **Max Sequence Length:** 128 tokens |
|
- **Special Features:** Khmer word segmentation and normalization |
|
- **Tokenization:** SentencePiece with Khmer-specific preprocessing |
|
|
|
## Citation |
|
|
|
If you use this model in your research, please cite: |
|
|
|
```bibtex |
|
@misc{sokhachan2025khmerhomophonecorrector, |
|
title={Khmer Homophone Corrector: A Fine-tuned PrahokBART Model for Khmer Text Correction}, |
|
author={Socheata Sokhachan}, |
|
year={2024}, |
|
url={https://huggingface.co/socheatasokhachan/khmerhomophonecorrector} |
|
} |
|
``` |
|
|
|
## Related Research |
|
|
|
This model builds upon and fine-tunes the PrahokBART model: |
|
|
|
**PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation** |
|
- Authors: Hour Kaing, Raj Dabre, Haiyue Song, Van-Hien Tran, Hideki Tanaka, Masao Utiyama |
|
- Published: COLING 2025 |
|
- DOI: [https://aclanthology.org/2025.coling-main.87.pdf](https://aclanthology.org/2025.coling-main.87.pdf) |
|
- Base Model: [https://huggingface.co/nict-astrec-att/prahokbart_big](https://huggingface.co/nict-astrec-att/prahokbart_big) |
|
|
|
## Acknowledgments |
|
|
|
- The PrahokBART research team for the base model |
|
- Hugging Face for the transformers library |
|
- The Khmer NLP community for language resources |
|
- Streamlit for the web framework |
|
- Contributors to the Khmer language processing tools |
|
--- |
|
|
|
**Note:** This model is specifically designed for Khmer language homophone correction and may not work optimally with other languages or tasks. |