File size: 8,519 Bytes
c8f99b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86273f2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c8f99b2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
---
language:
- km
license: mit
tags:
- khmer
- homophone-correction
- text-generation
- seq2seq
- prahokbart
datasets:
- custom-khmer-homophone
metrics:
- bleu
- wer
pipeline_tag: text2text-generation
base_model:
- nict-astrec-att/prahokbart_big
---

# Khmer Homophone Corrector

A fine-tuned PrahokBART model specifically designed for correcting homophones in Khmer text. This model builds upon PrahokBART, a pre-trained sequence-to-sequence model for Khmer natural language generation, and addresses the unique challenges of Khmer language processing, including word boundary issues and homophone confusion.

## Model Description

- **Developed by:** [Socheata Sokhachan](https://github.com/SocheataSokhaChan22)
- **Model type:** PrahokBART (fine-tuned for homophone correction)
- **Base Model:** [PrahokBART](https://huggingface.co/nict-astrec-att/prahokbart_big)
- **Language:** Khmer (km)
- **Repository:** [GitHub](https://github.com/SocheataSokhaChan22/khmerhomophonecorrector)
- **Live Demo:** [Streamlit App](https://khmerhomophonecorrector.streamlit.app)

## Intended Uses & Limitations

### Intended Use Cases
- **Homophone Correction:** Correcting commonly confused Khmer homophones in text
- **Educational Applications:** Helping students learn proper Khmer spelling
- **Text Preprocessing:** Improving text quality for downstream Khmer NLP tasks
- **Content Creation:** Assisting writers in producing error-free Khmer content

### Limitations
- **Language Specific:** Only works with Khmer text
- **Homophone Focus:** Designed specifically for homophone correction, not general grammar or spelling
- **Context Dependency:** May require surrounding context for optimal corrections
- **Training Data Scope:** Limited to the homophone pairs in the training dataset

## Training and Evaluation Data

### Training Data
- **Dataset:** Custom Khmer homophone dataset
- **Size:** 268+ homophone groups
- **Coverage:** Common Khmer homophones across different word categories
- **Preprocessing:** Word segmentation using Khmer NLP tools
- **Format:** JSON with input-target pairs

### Evaluation Data
- **Test Set:** Homophone pairs not seen during training
- **Metrics:** BLEU score, WER, and human evaluation
- **Validation:** Cross-validation on homophone groups

### Data Preprocessing
1. **Word Segmentation:** Using Khmer word tokenization (`khmer_nltk.word_tokenize`)
2. **Text Normalization:** Standardizing text format with special tokens
3. **Special Tokens:** Adding `</s> <2km>` for input and `<2km> ... </s>` for target
4. **Sequence Format:** Converting to sequence-to-sequence format
5. **Padding:** Max length 128 tokens with padding

## Training Results

### Performance Metrics
- **BLEU-1 Score:** 99.5398
- **BLEU-2 Score:** 99.162
- **BLEU-3 Score:** 98.8093
- **BLEU-4 Score:** 98.4861
- **WER (Word Error Rate):** 0.008
- **Human Evaluation Score:** 0.008
- **Final Training Loss:** 0.0091
- **Final Validation Loss:** 0.023525

### Training Analysis
The model demonstrates exceptional performance and training characteristics:

- **Rapid Convergence:** Training loss decreased dramatically from 0.6786 in epoch 1 to 0.0091 in epoch 40, showing excellent learning progression
- **Stable Validation:** Validation loss stabilized around 0.023 after epoch 15, indicating consistent generalization performance
- **Outstanding Accuracy:** Achieved exceptional BLEU scores with BLEU-1 reaching 99.54% and BLEU-4 at 98.49%, demonstrating near-perfect homophone correction
- **Minimal Error Rate:** WER of 0.008 indicates extremely low word error rate, making the model highly reliable for practical applications
- **No Overfitting:** The small and consistent gap between training (0.0091) and validation loss (0.0235) suggests excellent generalization without overfitting
- **Early Performance:** Remarkably, the model achieved its best BLEU scores and WER as early as epoch 1, indicating the effectiveness of the PrahokBART base model for Khmer homophone correction

### Training Configuration
- **Base Model:** PrahokBART (from [nict-astrec-att/prahokbart_big](https://huggingface.co/nict-astrec-att/prahokbart_big))
- **Model Architecture:** PrahokBART (Khmer-specific BART variant)
- **Training Framework:** Hugging Face Transformers
- **Optimizer:** AdamW
- **Learning Rate:** 3e-5
- **Batch Size:** 32 (per device)
- **Training Epochs:** 40
- **Warmup Ratio:** 0.1
- **Weight Decay:** 0.01
- **Mixed Precision:** FP16 enabled
- **Evaluation Strategy:** Every epoch
- **Save Strategy:** Every epoch (best 2 checkpoints)
- **Max Sequence Length:** 128 tokens
- **Resume Training:** Supported with checkpoint management

## Usage

### Basic Usage

```python
from transformers import MBartForConditionalGeneration, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "socheatasokhachan/khmerhomophonecorrector"
model = MBartForConditionalGeneration.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()

# Example text with homophones
text = "αžαŸ’αž‰αž»αŸ†αž€αŸ†αž–αž„αŸ‹αž“αžΌαžœαžŸαž€αž›αžœαž·αž‘αŸ’αž™αžΆαž›αŸαž™"  # Input with homophone error

# Preprocess text (word segmentation)
from khmer_nltk import word_tokenize
segmented_text = " ".join(word_tokenize(text))

# Prepare input
input_text = f"{segmented_text} </s> <2km>"
inputs = tokenizer(
    input_text,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=1024,
    add_special_tokens=True
)

# Move to device
inputs = {k: v.to(device) for k, v in inputs.items()}

# Generate correction
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_length=1024,
        num_beams=5,
        early_stopping=True,
        do_sample=False,
        no_repeat_ngram_size=3,
        forced_bos_token_id=32000,
        forced_eos_token_id=32001,
        length_penalty=1.0,
        temperature=1.0
    )

# Decode output
corrected = tokenizer.decode(outputs[0], skip_special_tokens=True)
corrected = corrected.replace("</s>", "").replace("<2km>", "").replace("β–‚", " ").strip()

print(f"Original: {text}")
print(f"Corrected: {corrected}")
# Expected output: αžαŸ’αž‰αž»αŸ†αž€αŸ†αž–αž»αž„αž“αŸ…αžŸαž€αž›αžœαž·αž‘αŸ’αž™αžΆαž›αŸαž™
```

### Using with Streamlit

```python
import streamlit as st
from transformers import MBartForConditionalGeneration, AutoTokenizer

@st.cache_resource
def load_model():
    model = MBartForConditionalGeneration.from_pretrained("socheatasokhachan/khmerhomophonecorrector")
    tokenizer = AutoTokenizer.from_pretrained("socheatasokhachan/khmerhomophonecorrector")
    return model, tokenizer

# Load model
model, tokenizer = load_model()

# Streamlit interface
st.title("Khmer Homophone Corrector")
user_input = st.text_area("Enter Khmer text:")
if st.button("Correct"):
    # Process text and display results
```

## Model Architecture

- **Base Model:** PrahokBART (Khmer-specific BART variant)
- **Architecture:** Sequence-to-Sequence Transformer
- **Max Sequence Length:** 128 tokens
- **Special Features:** Khmer word segmentation and normalization
- **Tokenization:** SentencePiece with Khmer-specific preprocessing

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{sokhachan2025khmerhomophonecorrector,
  title={Khmer Homophone Corrector: A Fine-tuned PrahokBART Model for Khmer Text Correction},
  author={Socheata Sokhachan},
  year={2024},
  url={https://huggingface.co/socheatasokhachan/khmerhomophonecorrector}
}
```

## Related Research

This model builds upon and fine-tunes the PrahokBART model:

**PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation**
- Authors: Hour Kaing, Raj Dabre, Haiyue Song, Van-Hien Tran, Hideki Tanaka, Masao Utiyama
- Published: COLING 2025
- DOI: [https://aclanthology.org/2025.coling-main.87.pdf](https://aclanthology.org/2025.coling-main.87.pdf)
- Base Model: [https://huggingface.co/nict-astrec-att/prahokbart_big](https://huggingface.co/nict-astrec-att/prahokbart_big)

## Acknowledgments

- The PrahokBART research team for the base model
- Hugging Face for the transformers library
- The Khmer NLP community for language resources
- Streamlit for the web framework
- Contributors to the Khmer language processing tools
---

**Note:** This model is specifically designed for Khmer language homophone correction and may not work optimally with other languages or tasks.