Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,213 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Khmer Homophone Corrector
|
2 |
+
|
3 |
+
A fine-tuned PrahokBART model specifically designed for correcting homophones in Khmer text. This model builds upon PrahokBART, a pre-trained sequence-to-sequence model for Khmer natural language generation, and addresses the unique challenges of Khmer language processing, including word boundary issues and homophone confusion.
|
4 |
+
|
5 |
+
## Model Description
|
6 |
+
|
7 |
+
- **Developed by:** [Socheata Sokhachan](https://github.com/SocheataSokhaChan22)
|
8 |
+
- **Model type:** PrahokBART (fine-tuned for homophone correction)
|
9 |
+
- **Base Model:** [PrahokBART](https://huggingface.co/nict-astrec-att/prahokbart_big)
|
10 |
+
- **Language:** Khmer (km)
|
11 |
+
- **License:** MIT
|
12 |
+
- **Repository:** [GitHub](https://github.com/SocheataSokhaChan22/khmerhomophonecorrector)
|
13 |
+
- **Live Demo:** [Streamlit App](https://khmerhomophonecorrector.streamlit.app)
|
14 |
+
|
15 |
+
## Intended Uses & Limitations
|
16 |
+
|
17 |
+
### Intended Use Cases
|
18 |
+
- **Homophone Correction:** Correcting commonly confused Khmer homophones in text
|
19 |
+
- **Educational Applications:** Helping students learn proper Khmer spelling
|
20 |
+
- **Text Preprocessing:** Improving text quality for downstream Khmer NLP tasks
|
21 |
+
- **Content Creation:** Assisting writers in producing error-free Khmer content
|
22 |
+
|
23 |
+
### Limitations
|
24 |
+
- **Language Specific:** Only works with Khmer text
|
25 |
+
- **Homophone Focus:** Designed specifically for homophone correction, not general grammar or spelling
|
26 |
+
- **Context Dependency:** May require surrounding context for optimal corrections
|
27 |
+
- **Training Data Scope:** Limited to the homophone pairs in the training dataset
|
28 |
+
|
29 |
+
## Training and Evaluation Data
|
30 |
+
|
31 |
+
### Training Data
|
32 |
+
- **Dataset:** Custom Khmer homophone dataset
|
33 |
+
- **Size:** 268+ homophone groups
|
34 |
+
- **Coverage:** Common Khmer homophones across different word categories
|
35 |
+
- **Preprocessing:** Word segmentation using Khmer NLP tools
|
36 |
+
- **Format:** JSON with input-target pairs
|
37 |
+
|
38 |
+
### Evaluation Data
|
39 |
+
- **Test Set:** Homophone pairs not seen during training
|
40 |
+
- **Metrics:** BLEU score, WER, and human evaluation
|
41 |
+
- **Validation:** Cross-validation on homophone groups
|
42 |
+
|
43 |
+
### Data Preprocessing
|
44 |
+
1. **Word Segmentation:** Using Khmer word tokenization (`khmer_nltk.word_tokenize`)
|
45 |
+
2. **Text Normalization:** Standardizing text format with special tokens
|
46 |
+
3. **Special Tokens:** Adding `</s> <2km>` for input and `<2km> ... </s>` for target
|
47 |
+
4. **Sequence Format:** Converting to sequence-to-sequence format
|
48 |
+
5. **Padding:** Max length 128 tokens with padding
|
49 |
+
|
50 |
+
## Training Results
|
51 |
+
|
52 |
+
### Performance Metrics
|
53 |
+
- **BLEU-1 Score:** 99.5398
|
54 |
+
- **BLEU-2 Score:** 99.162
|
55 |
+
- **BLEU-3 Score:** 98.8093
|
56 |
+
- **BLEU-4 Score:** 98.4861
|
57 |
+
- **WER (Word Error Rate):** 0.008
|
58 |
+
- **Human Evaluation Score:** 0.008
|
59 |
+
- **Final Training Loss:** 0.0091
|
60 |
+
- **Final Validation Loss:** 0.023525
|
61 |
+
|
62 |
+
### Training Analysis
|
63 |
+
The model demonstrates exceptional performance and training characteristics:
|
64 |
+
|
65 |
+
- **Rapid Convergence:** Training loss decreased dramatically from 0.6786 in epoch 1 to 0.0091 in epoch 40, showing excellent learning progression
|
66 |
+
- **Stable Validation:** Validation loss stabilized around 0.023 after epoch 15, indicating consistent generalization performance
|
67 |
+
- **Outstanding Accuracy:** Achieved exceptional BLEU scores with BLEU-1 reaching 99.54% and BLEU-4 at 98.49%, demonstrating near-perfect homophone correction
|
68 |
+
- **Minimal Error Rate:** WER of 0.008 indicates extremely low word error rate, making the model highly reliable for practical applications
|
69 |
+
- **No Overfitting:** The small and consistent gap between training (0.0091) and validation loss (0.0235) suggests excellent generalization without overfitting
|
70 |
+
- **Early Performance:** Remarkably, the model achieved its best BLEU scores and WER as early as epoch 1, indicating the effectiveness of the PrahokBART base model for Khmer homophone correction
|
71 |
+
|
72 |
+
### Training Configuration
|
73 |
+
- **Base Model:** PrahokBART (from [nict-astrec-att/prahokbart_big](https://huggingface.co/nict-astrec-att/prahokbart_big))
|
74 |
+
- **Model Architecture:** PrahokBART (Khmer-specific BART variant)
|
75 |
+
- **Training Framework:** Hugging Face Transformers
|
76 |
+
- **Optimizer:** AdamW
|
77 |
+
- **Learning Rate:** 3e-5
|
78 |
+
- **Batch Size:** 32 (per device)
|
79 |
+
- **Training Epochs:** 40
|
80 |
+
- **Warmup Ratio:** 0.1
|
81 |
+
- **Weight Decay:** 0.01
|
82 |
+
- **Mixed Precision:** FP16 enabled
|
83 |
+
- **Evaluation Strategy:** Every epoch
|
84 |
+
- **Save Strategy:** Every epoch (best 2 checkpoints)
|
85 |
+
- **Max Sequence Length:** 128 tokens
|
86 |
+
- **Resume Training:** Supported with checkpoint management
|
87 |
+
|
88 |
+
## Usage
|
89 |
+
|
90 |
+
### Basic Usage
|
91 |
+
|
92 |
+
```python
|
93 |
+
from transformers import MBartForConditionalGeneration, AutoTokenizer
|
94 |
+
import torch
|
95 |
+
|
96 |
+
# Load model and tokenizer
|
97 |
+
model_name = "socheatasokhachan/khmerhomophonecorrector"
|
98 |
+
model = MBartForConditionalGeneration.from_pretrained(model_name)
|
99 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
100 |
+
|
101 |
+
# Set device
|
102 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
103 |
+
model = model.to(device)
|
104 |
+
model.eval()
|
105 |
+
|
106 |
+
# Example text with homophones
|
107 |
+
text = "αααα»ααααααααΌαααααα·ααααΆααα" # Input with homophone error
|
108 |
+
|
109 |
+
# Preprocess text (word segmentation)
|
110 |
+
from khmer_nltk import word_tokenize
|
111 |
+
segmented_text = " ".join(word_tokenize(text))
|
112 |
+
|
113 |
+
# Prepare input
|
114 |
+
input_text = f"{segmented_text} </s> <2km>"
|
115 |
+
inputs = tokenizer(
|
116 |
+
input_text,
|
117 |
+
return_tensors="pt",
|
118 |
+
padding=True,
|
119 |
+
truncation=True,
|
120 |
+
max_length=1024,
|
121 |
+
add_special_tokens=True
|
122 |
+
)
|
123 |
+
|
124 |
+
# Move to device
|
125 |
+
inputs = {k: v.to(device) for k, v in inputs.items()}
|
126 |
+
|
127 |
+
# Generate correction
|
128 |
+
with torch.no_grad():
|
129 |
+
outputs = model.generate(
|
130 |
+
**inputs,
|
131 |
+
max_length=1024,
|
132 |
+
num_beams=5,
|
133 |
+
early_stopping=True,
|
134 |
+
do_sample=False,
|
135 |
+
no_repeat_ngram_size=3,
|
136 |
+
forced_bos_token_id=32000,
|
137 |
+
forced_eos_token_id=32001,
|
138 |
+
length_penalty=1.0,
|
139 |
+
temperature=1.0
|
140 |
+
)
|
141 |
+
|
142 |
+
# Decode output
|
143 |
+
corrected = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
144 |
+
corrected = corrected.replace("</s>", "").replace("<2km>", "").replace("β", " ").strip()
|
145 |
+
|
146 |
+
print(f"Original: {text}")
|
147 |
+
print(f"Corrected: {corrected}")
|
148 |
+
# Expected output: αααα»ααααα»ααα
ααααα·ααααΆααα
|
149 |
+
```
|
150 |
+
|
151 |
+
### Using with Streamlit
|
152 |
+
|
153 |
+
```python
|
154 |
+
import streamlit as st
|
155 |
+
from transformers import MBartForConditionalGeneration, AutoTokenizer
|
156 |
+
|
157 |
+
@st.cache_resource
|
158 |
+
def load_model():
|
159 |
+
model = MBartForConditionalGeneration.from_pretrained("socheatasokhachan/khmerhomophonecorrector")
|
160 |
+
tokenizer = AutoTokenizer.from_pretrained("socheatasokhachan/khmerhomophonecorrector")
|
161 |
+
return model, tokenizer
|
162 |
+
|
163 |
+
# Load model
|
164 |
+
model, tokenizer = load_model()
|
165 |
+
|
166 |
+
# Streamlit interface
|
167 |
+
st.title("Khmer Homophone Corrector")
|
168 |
+
user_input = st.text_area("Enter Khmer text:")
|
169 |
+
if st.button("Correct"):
|
170 |
+
# Process text and display results
|
171 |
+
```
|
172 |
+
|
173 |
+
## Model Architecture
|
174 |
+
|
175 |
+
- **Base Model:** PrahokBART (Khmer-specific BART variant)
|
176 |
+
- **Architecture:** Sequence-to-Sequence Transformer
|
177 |
+
- **Max Sequence Length:** 128 tokens
|
178 |
+
- **Special Features:** Khmer word segmentation and normalization
|
179 |
+
- **Tokenization:** SentencePiece with Khmer-specific preprocessing
|
180 |
+
|
181 |
+
## Citation
|
182 |
+
|
183 |
+
If you use this model in your research, please cite:
|
184 |
+
|
185 |
+
```bibtex
|
186 |
+
@misc{sokhachan2025khmerhomophonecorrector,
|
187 |
+
title={Khmer Homophone Corrector: A Fine-tuned PrahokBART Model for Khmer Text Correction},
|
188 |
+
author={Socheata Sokhachan},
|
189 |
+
year={2024},
|
190 |
+
url={https://huggingface.co/socheatasokhachan/khmerhomophonecorrector}
|
191 |
+
}
|
192 |
+
```
|
193 |
+
|
194 |
+
## Related Research
|
195 |
+
|
196 |
+
This model builds upon and fine-tunes the PrahokBART model:
|
197 |
+
|
198 |
+
**PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation**
|
199 |
+
- Authors: Hour Kaing, Raj Dabre, Haiyue Song, Van-Hien Tran, Hideki Tanaka, Masao Utiyama
|
200 |
+
- Published: COLING 2025
|
201 |
+
- DOI: [https://aclanthology.org/2025.coling-main.87.pdf](https://aclanthology.org/2025.coling-main.87.pdf)
|
202 |
+
- Base Model: [https://huggingface.co/nict-astrec-att/prahokbart_big](https://huggingface.co/nict-astrec-att/prahokbart_big)
|
203 |
+
|
204 |
+
## Acknowledgments
|
205 |
+
|
206 |
+
- The PrahokBART research team for the base model
|
207 |
+
- Hugging Face for the transformers library
|
208 |
+
- The Khmer NLP community for language resources
|
209 |
+
- Streamlit for the web framework
|
210 |
+
- Contributors to the Khmer language processing tools
|
211 |
+
---
|
212 |
+
|
213 |
+
**Note:** This model is specifically designed for Khmer language homophone correction and may not work optimally with other languages or tasks.
|