File size: 8,519 Bytes
c8f99b2 86273f2 c8f99b2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 |
---
language:
- km
license: mit
tags:
- khmer
- homophone-correction
- text-generation
- seq2seq
- prahokbart
datasets:
- custom-khmer-homophone
metrics:
- bleu
- wer
pipeline_tag: text2text-generation
base_model:
- nict-astrec-att/prahokbart_big
---
# Khmer Homophone Corrector
A fine-tuned PrahokBART model specifically designed for correcting homophones in Khmer text. This model builds upon PrahokBART, a pre-trained sequence-to-sequence model for Khmer natural language generation, and addresses the unique challenges of Khmer language processing, including word boundary issues and homophone confusion.
## Model Description
- **Developed by:** [Socheata Sokhachan](https://github.com/SocheataSokhaChan22)
- **Model type:** PrahokBART (fine-tuned for homophone correction)
- **Base Model:** [PrahokBART](https://huggingface.co/nict-astrec-att/prahokbart_big)
- **Language:** Khmer (km)
- **Repository:** [GitHub](https://github.com/SocheataSokhaChan22/khmerhomophonecorrector)
- **Live Demo:** [Streamlit App](https://khmerhomophonecorrector.streamlit.app)
## Intended Uses & Limitations
### Intended Use Cases
- **Homophone Correction:** Correcting commonly confused Khmer homophones in text
- **Educational Applications:** Helping students learn proper Khmer spelling
- **Text Preprocessing:** Improving text quality for downstream Khmer NLP tasks
- **Content Creation:** Assisting writers in producing error-free Khmer content
### Limitations
- **Language Specific:** Only works with Khmer text
- **Homophone Focus:** Designed specifically for homophone correction, not general grammar or spelling
- **Context Dependency:** May require surrounding context for optimal corrections
- **Training Data Scope:** Limited to the homophone pairs in the training dataset
## Training and Evaluation Data
### Training Data
- **Dataset:** Custom Khmer homophone dataset
- **Size:** 268+ homophone groups
- **Coverage:** Common Khmer homophones across different word categories
- **Preprocessing:** Word segmentation using Khmer NLP tools
- **Format:** JSON with input-target pairs
### Evaluation Data
- **Test Set:** Homophone pairs not seen during training
- **Metrics:** BLEU score, WER, and human evaluation
- **Validation:** Cross-validation on homophone groups
### Data Preprocessing
1. **Word Segmentation:** Using Khmer word tokenization (`khmer_nltk.word_tokenize`)
2. **Text Normalization:** Standardizing text format with special tokens
3. **Special Tokens:** Adding `</s> <2km>` for input and `<2km> ... </s>` for target
4. **Sequence Format:** Converting to sequence-to-sequence format
5. **Padding:** Max length 128 tokens with padding
## Training Results
### Performance Metrics
- **BLEU-1 Score:** 99.5398
- **BLEU-2 Score:** 99.162
- **BLEU-3 Score:** 98.8093
- **BLEU-4 Score:** 98.4861
- **WER (Word Error Rate):** 0.008
- **Human Evaluation Score:** 0.008
- **Final Training Loss:** 0.0091
- **Final Validation Loss:** 0.023525
### Training Analysis
The model demonstrates exceptional performance and training characteristics:
- **Rapid Convergence:** Training loss decreased dramatically from 0.6786 in epoch 1 to 0.0091 in epoch 40, showing excellent learning progression
- **Stable Validation:** Validation loss stabilized around 0.023 after epoch 15, indicating consistent generalization performance
- **Outstanding Accuracy:** Achieved exceptional BLEU scores with BLEU-1 reaching 99.54% and BLEU-4 at 98.49%, demonstrating near-perfect homophone correction
- **Minimal Error Rate:** WER of 0.008 indicates extremely low word error rate, making the model highly reliable for practical applications
- **No Overfitting:** The small and consistent gap between training (0.0091) and validation loss (0.0235) suggests excellent generalization without overfitting
- **Early Performance:** Remarkably, the model achieved its best BLEU scores and WER as early as epoch 1, indicating the effectiveness of the PrahokBART base model for Khmer homophone correction
### Training Configuration
- **Base Model:** PrahokBART (from [nict-astrec-att/prahokbart_big](https://huggingface.co/nict-astrec-att/prahokbart_big))
- **Model Architecture:** PrahokBART (Khmer-specific BART variant)
- **Training Framework:** Hugging Face Transformers
- **Optimizer:** AdamW
- **Learning Rate:** 3e-5
- **Batch Size:** 32 (per device)
- **Training Epochs:** 40
- **Warmup Ratio:** 0.1
- **Weight Decay:** 0.01
- **Mixed Precision:** FP16 enabled
- **Evaluation Strategy:** Every epoch
- **Save Strategy:** Every epoch (best 2 checkpoints)
- **Max Sequence Length:** 128 tokens
- **Resume Training:** Supported with checkpoint management
## Usage
### Basic Usage
```python
from transformers import MBartForConditionalGeneration, AutoTokenizer
import torch
# Load model and tokenizer
model_name = "socheatasokhachan/khmerhomophonecorrector"
model = MBartForConditionalGeneration.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()
# Example text with homophones
text = "αααα»ααααααααΌαααααα·ααααΆααα" # Input with homophone error
# Preprocess text (word segmentation)
from khmer_nltk import word_tokenize
segmented_text = " ".join(word_tokenize(text))
# Prepare input
input_text = f"{segmented_text} </s> <2km>"
inputs = tokenizer(
input_text,
return_tensors="pt",
padding=True,
truncation=True,
max_length=1024,
add_special_tokens=True
)
# Move to device
inputs = {k: v.to(device) for k, v in inputs.items()}
# Generate correction
with torch.no_grad():
outputs = model.generate(
**inputs,
max_length=1024,
num_beams=5,
early_stopping=True,
do_sample=False,
no_repeat_ngram_size=3,
forced_bos_token_id=32000,
forced_eos_token_id=32001,
length_penalty=1.0,
temperature=1.0
)
# Decode output
corrected = tokenizer.decode(outputs[0], skip_special_tokens=True)
corrected = corrected.replace("</s>", "").replace("<2km>", "").replace("β", " ").strip()
print(f"Original: {text}")
print(f"Corrected: {corrected}")
# Expected output: αααα»ααααα»ααα
ααααα·ααααΆααα
```
### Using with Streamlit
```python
import streamlit as st
from transformers import MBartForConditionalGeneration, AutoTokenizer
@st.cache_resource
def load_model():
model = MBartForConditionalGeneration.from_pretrained("socheatasokhachan/khmerhomophonecorrector")
tokenizer = AutoTokenizer.from_pretrained("socheatasokhachan/khmerhomophonecorrector")
return model, tokenizer
# Load model
model, tokenizer = load_model()
# Streamlit interface
st.title("Khmer Homophone Corrector")
user_input = st.text_area("Enter Khmer text:")
if st.button("Correct"):
# Process text and display results
```
## Model Architecture
- **Base Model:** PrahokBART (Khmer-specific BART variant)
- **Architecture:** Sequence-to-Sequence Transformer
- **Max Sequence Length:** 128 tokens
- **Special Features:** Khmer word segmentation and normalization
- **Tokenization:** SentencePiece with Khmer-specific preprocessing
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{sokhachan2025khmerhomophonecorrector,
title={Khmer Homophone Corrector: A Fine-tuned PrahokBART Model for Khmer Text Correction},
author={Socheata Sokhachan},
year={2024},
url={https://huggingface.co/socheatasokhachan/khmerhomophonecorrector}
}
```
## Related Research
This model builds upon and fine-tunes the PrahokBART model:
**PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation**
- Authors: Hour Kaing, Raj Dabre, Haiyue Song, Van-Hien Tran, Hideki Tanaka, Masao Utiyama
- Published: COLING 2025
- DOI: [https://aclanthology.org/2025.coling-main.87.pdf](https://aclanthology.org/2025.coling-main.87.pdf)
- Base Model: [https://huggingface.co/nict-astrec-att/prahokbart_big](https://huggingface.co/nict-astrec-att/prahokbart_big)
## Acknowledgments
- The PrahokBART research team for the base model
- Hugging Face for the transformers library
- The Khmer NLP community for language resources
- Streamlit for the web framework
- Contributors to the Khmer language processing tools
---
**Note:** This model is specifically designed for Khmer language homophone correction and may not work optimally with other languages or tasks. |