socheatasokhachan commited on
Commit
86273f2
Β·
verified Β·
1 Parent(s): a10721d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +213 -0
README.md ADDED
@@ -0,0 +1,213 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Khmer Homophone Corrector
2
+
3
+ A fine-tuned PrahokBART model specifically designed for correcting homophones in Khmer text. This model builds upon PrahokBART, a pre-trained sequence-to-sequence model for Khmer natural language generation, and addresses the unique challenges of Khmer language processing, including word boundary issues and homophone confusion.
4
+
5
+ ## Model Description
6
+
7
+ - **Developed by:** [Socheata Sokhachan](https://github.com/SocheataSokhaChan22)
8
+ - **Model type:** PrahokBART (fine-tuned for homophone correction)
9
+ - **Base Model:** [PrahokBART](https://huggingface.co/nict-astrec-att/prahokbart_big)
10
+ - **Language:** Khmer (km)
11
+ - **License:** MIT
12
+ - **Repository:** [GitHub](https://github.com/SocheataSokhaChan22/khmerhomophonecorrector)
13
+ - **Live Demo:** [Streamlit App](https://khmerhomophonecorrector.streamlit.app)
14
+
15
+ ## Intended Uses & Limitations
16
+
17
+ ### Intended Use Cases
18
+ - **Homophone Correction:** Correcting commonly confused Khmer homophones in text
19
+ - **Educational Applications:** Helping students learn proper Khmer spelling
20
+ - **Text Preprocessing:** Improving text quality for downstream Khmer NLP tasks
21
+ - **Content Creation:** Assisting writers in producing error-free Khmer content
22
+
23
+ ### Limitations
24
+ - **Language Specific:** Only works with Khmer text
25
+ - **Homophone Focus:** Designed specifically for homophone correction, not general grammar or spelling
26
+ - **Context Dependency:** May require surrounding context for optimal corrections
27
+ - **Training Data Scope:** Limited to the homophone pairs in the training dataset
28
+
29
+ ## Training and Evaluation Data
30
+
31
+ ### Training Data
32
+ - **Dataset:** Custom Khmer homophone dataset
33
+ - **Size:** 268+ homophone groups
34
+ - **Coverage:** Common Khmer homophones across different word categories
35
+ - **Preprocessing:** Word segmentation using Khmer NLP tools
36
+ - **Format:** JSON with input-target pairs
37
+
38
+ ### Evaluation Data
39
+ - **Test Set:** Homophone pairs not seen during training
40
+ - **Metrics:** BLEU score, WER, and human evaluation
41
+ - **Validation:** Cross-validation on homophone groups
42
+
43
+ ### Data Preprocessing
44
+ 1. **Word Segmentation:** Using Khmer word tokenization (`khmer_nltk.word_tokenize`)
45
+ 2. **Text Normalization:** Standardizing text format with special tokens
46
+ 3. **Special Tokens:** Adding `</s> <2km>` for input and `<2km> ... </s>` for target
47
+ 4. **Sequence Format:** Converting to sequence-to-sequence format
48
+ 5. **Padding:** Max length 128 tokens with padding
49
+
50
+ ## Training Results
51
+
52
+ ### Performance Metrics
53
+ - **BLEU-1 Score:** 99.5398
54
+ - **BLEU-2 Score:** 99.162
55
+ - **BLEU-3 Score:** 98.8093
56
+ - **BLEU-4 Score:** 98.4861
57
+ - **WER (Word Error Rate):** 0.008
58
+ - **Human Evaluation Score:** 0.008
59
+ - **Final Training Loss:** 0.0091
60
+ - **Final Validation Loss:** 0.023525
61
+
62
+ ### Training Analysis
63
+ The model demonstrates exceptional performance and training characteristics:
64
+
65
+ - **Rapid Convergence:** Training loss decreased dramatically from 0.6786 in epoch 1 to 0.0091 in epoch 40, showing excellent learning progression
66
+ - **Stable Validation:** Validation loss stabilized around 0.023 after epoch 15, indicating consistent generalization performance
67
+ - **Outstanding Accuracy:** Achieved exceptional BLEU scores with BLEU-1 reaching 99.54% and BLEU-4 at 98.49%, demonstrating near-perfect homophone correction
68
+ - **Minimal Error Rate:** WER of 0.008 indicates extremely low word error rate, making the model highly reliable for practical applications
69
+ - **No Overfitting:** The small and consistent gap between training (0.0091) and validation loss (0.0235) suggests excellent generalization without overfitting
70
+ - **Early Performance:** Remarkably, the model achieved its best BLEU scores and WER as early as epoch 1, indicating the effectiveness of the PrahokBART base model for Khmer homophone correction
71
+
72
+ ### Training Configuration
73
+ - **Base Model:** PrahokBART (from [nict-astrec-att/prahokbart_big](https://huggingface.co/nict-astrec-att/prahokbart_big))
74
+ - **Model Architecture:** PrahokBART (Khmer-specific BART variant)
75
+ - **Training Framework:** Hugging Face Transformers
76
+ - **Optimizer:** AdamW
77
+ - **Learning Rate:** 3e-5
78
+ - **Batch Size:** 32 (per device)
79
+ - **Training Epochs:** 40
80
+ - **Warmup Ratio:** 0.1
81
+ - **Weight Decay:** 0.01
82
+ - **Mixed Precision:** FP16 enabled
83
+ - **Evaluation Strategy:** Every epoch
84
+ - **Save Strategy:** Every epoch (best 2 checkpoints)
85
+ - **Max Sequence Length:** 128 tokens
86
+ - **Resume Training:** Supported with checkpoint management
87
+
88
+ ## Usage
89
+
90
+ ### Basic Usage
91
+
92
+ ```python
93
+ from transformers import MBartForConditionalGeneration, AutoTokenizer
94
+ import torch
95
+
96
+ # Load model and tokenizer
97
+ model_name = "socheatasokhachan/khmerhomophonecorrector"
98
+ model = MBartForConditionalGeneration.from_pretrained(model_name)
99
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
100
+
101
+ # Set device
102
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
103
+ model = model.to(device)
104
+ model.eval()
105
+
106
+ # Example text with homophones
107
+ text = "αžαŸ’αž‰αž»αŸ†αž€αŸ†αž–αž„αŸ‹αž“αžΌαžœαžŸαž€αž›αžœαž·αž‘αŸ’αž™αžΆαž›αŸαž™" # Input with homophone error
108
+
109
+ # Preprocess text (word segmentation)
110
+ from khmer_nltk import word_tokenize
111
+ segmented_text = " ".join(word_tokenize(text))
112
+
113
+ # Prepare input
114
+ input_text = f"{segmented_text} </s> <2km>"
115
+ inputs = tokenizer(
116
+ input_text,
117
+ return_tensors="pt",
118
+ padding=True,
119
+ truncation=True,
120
+ max_length=1024,
121
+ add_special_tokens=True
122
+ )
123
+
124
+ # Move to device
125
+ inputs = {k: v.to(device) for k, v in inputs.items()}
126
+
127
+ # Generate correction
128
+ with torch.no_grad():
129
+ outputs = model.generate(
130
+ **inputs,
131
+ max_length=1024,
132
+ num_beams=5,
133
+ early_stopping=True,
134
+ do_sample=False,
135
+ no_repeat_ngram_size=3,
136
+ forced_bos_token_id=32000,
137
+ forced_eos_token_id=32001,
138
+ length_penalty=1.0,
139
+ temperature=1.0
140
+ )
141
+
142
+ # Decode output
143
+ corrected = tokenizer.decode(outputs[0], skip_special_tokens=True)
144
+ corrected = corrected.replace("</s>", "").replace("<2km>", "").replace("β–‚", " ").strip()
145
+
146
+ print(f"Original: {text}")
147
+ print(f"Corrected: {corrected}")
148
+ # Expected output: αžαŸ’αž‰αž»αŸ†αž€αŸ†αž–αž»αž„αž“αŸ…αžŸαž€αž›αžœαž·αž‘αŸ’αž™αžΆαž›αŸαž™
149
+ ```
150
+
151
+ ### Using with Streamlit
152
+
153
+ ```python
154
+ import streamlit as st
155
+ from transformers import MBartForConditionalGeneration, AutoTokenizer
156
+
157
+ @st.cache_resource
158
+ def load_model():
159
+ model = MBartForConditionalGeneration.from_pretrained("socheatasokhachan/khmerhomophonecorrector")
160
+ tokenizer = AutoTokenizer.from_pretrained("socheatasokhachan/khmerhomophonecorrector")
161
+ return model, tokenizer
162
+
163
+ # Load model
164
+ model, tokenizer = load_model()
165
+
166
+ # Streamlit interface
167
+ st.title("Khmer Homophone Corrector")
168
+ user_input = st.text_area("Enter Khmer text:")
169
+ if st.button("Correct"):
170
+ # Process text and display results
171
+ ```
172
+
173
+ ## Model Architecture
174
+
175
+ - **Base Model:** PrahokBART (Khmer-specific BART variant)
176
+ - **Architecture:** Sequence-to-Sequence Transformer
177
+ - **Max Sequence Length:** 128 tokens
178
+ - **Special Features:** Khmer word segmentation and normalization
179
+ - **Tokenization:** SentencePiece with Khmer-specific preprocessing
180
+
181
+ ## Citation
182
+
183
+ If you use this model in your research, please cite:
184
+
185
+ ```bibtex
186
+ @misc{sokhachan2025khmerhomophonecorrector,
187
+ title={Khmer Homophone Corrector: A Fine-tuned PrahokBART Model for Khmer Text Correction},
188
+ author={Socheata Sokhachan},
189
+ year={2024},
190
+ url={https://huggingface.co/socheatasokhachan/khmerhomophonecorrector}
191
+ }
192
+ ```
193
+
194
+ ## Related Research
195
+
196
+ This model builds upon and fine-tunes the PrahokBART model:
197
+
198
+ **PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation**
199
+ - Authors: Hour Kaing, Raj Dabre, Haiyue Song, Van-Hien Tran, Hideki Tanaka, Masao Utiyama
200
+ - Published: COLING 2025
201
+ - DOI: [https://aclanthology.org/2025.coling-main.87.pdf](https://aclanthology.org/2025.coling-main.87.pdf)
202
+ - Base Model: [https://huggingface.co/nict-astrec-att/prahokbart_big](https://huggingface.co/nict-astrec-att/prahokbart_big)
203
+
204
+ ## Acknowledgments
205
+
206
+ - The PrahokBART research team for the base model
207
+ - Hugging Face for the transformers library
208
+ - The Khmer NLP community for language resources
209
+ - Streamlit for the web framework
210
+ - Contributors to the Khmer language processing tools
211
+ ---
212
+
213
+ **Note:** This model is specifically designed for Khmer language homophone correction and may not work optimally with other languages or tasks.