socheatasokhachan
/

khmerhomophonecorrector

+# Khmer Homophone Corrector
+A fine-tuned PrahokBART model specifically designed for correcting homophones in Khmer text. This model builds upon PrahokBART, a pre-trained sequence-to-sequence model for Khmer natural language generation, and addresses the unique challenges of Khmer language processing, including word boundary issues and homophone confusion.
+## Model Description
+- **Developed by:** [Socheata Sokhachan](https://github.com/SocheataSokhaChan22)
+- **Model type:** PrahokBART (fine-tuned for homophone correction)
+- **Base Model:** [PrahokBART](https://huggingface.co/nict-astrec-att/prahokbart_big)
+- **Language:** Khmer (km)
+- **License:** MIT
+- **Repository:** [GitHub](https://github.com/SocheataSokhaChan22/khmerhomophonecorrector)
+- **Live Demo:** [Streamlit App](https://khmerhomophonecorrector.streamlit.app)
+## Intended Uses & Limitations
+### Intended Use Cases
+- **Homophone Correction:** Correcting commonly confused Khmer homophones in text
+- **Educational Applications:** Helping students learn proper Khmer spelling
+- **Text Preprocessing:** Improving text quality for downstream Khmer NLP tasks
+- **Content Creation:** Assisting writers in producing error-free Khmer content
+### Limitations
+- **Language Specific:** Only works with Khmer text
+- **Homophone Focus:** Designed specifically for homophone correction, not general grammar or spelling
+- **Context Dependency:** May require surrounding context for optimal corrections
+- **Training Data Scope:** Limited to the homophone pairs in the training dataset
+## Training and Evaluation Data
+### Training Data
+- **Dataset:** Custom Khmer homophone dataset
+- **Size:** 268+ homophone groups
+- **Coverage:** Common Khmer homophones across different word categories
+- **Preprocessing:** Word segmentation using Khmer NLP tools
+- **Format:** JSON with input-target pairs
+### Evaluation Data
+- **Test Set:** Homophone pairs not seen during training
+- **Metrics:** BLEU score, WER, and human evaluation
+- **Validation:** Cross-validation on homophone groups
+### Data Preprocessing
+1. **Word Segmentation:** Using Khmer word tokenization (`khmer_nltk.word_tokenize`)
+2. **Text Normalization:** Standardizing text format with special tokens
+3. **Special Tokens:** Adding `</s> <2km>` for input and `<2km> ... </s>` for target
+4. **Sequence Format:** Converting to sequence-to-sequence format
+5. **Padding:** Max length 128 tokens with padding
+## Training Results
+### Performance Metrics
+- **BLEU-1 Score:** 99.5398
+- **BLEU-2 Score:** 99.162
+- **BLEU-3 Score:** 98.8093
+- **BLEU-4 Score:** 98.4861
+- **WER (Word Error Rate):** 0.008
+- **Human Evaluation Score:** 0.008
+- **Final Training Loss:** 0.0091
+- **Final Validation Loss:** 0.023525
+### Training Analysis
+The model demonstrates exceptional performance and training characteristics:
+- **Rapid Convergence:** Training loss decreased dramatically from 0.6786 in epoch 1 to 0.0091 in epoch 40, showing excellent learning progression
+- **Stable Validation:** Validation loss stabilized around 0.023 after epoch 15, indicating consistent generalization performance
+- **Outstanding Accuracy:** Achieved exceptional BLEU scores with BLEU-1 reaching 99.54% and BLEU-4 at 98.49%, demonstrating near-perfect homophone correction
+- **Minimal Error Rate:** WER of 0.008 indicates extremely low word error rate, making the model highly reliable for practical applications
+- **No Overfitting:** The small and consistent gap between training (0.0091) and validation loss (0.0235) suggests excellent generalization without overfitting
+- **Early Performance:** Remarkably, the model achieved its best BLEU scores and WER as early as epoch 1, indicating the effectiveness of the PrahokBART base model for Khmer homophone correction
+### Training Configuration
+- **Base Model:** PrahokBART (from [nict-astrec-att/prahokbart_big](https://huggingface.co/nict-astrec-att/prahokbart_big))
+- **Model Architecture:** PrahokBART (Khmer-specific BART variant)
+- **Training Framework:** Hugging Face Transformers
+- **Optimizer:** AdamW
+- **Learning Rate:** 3e-5
+- **Batch Size:** 32 (per device)
+- **Training Epochs:** 40
+- **Warmup Ratio:** 0.1
+- **Weight Decay:** 0.01
+- **Mixed Precision:** FP16 enabled
+- **Evaluation Strategy:** Every epoch
+- **Save Strategy:** Every epoch (best 2 checkpoints)
+- **Max Sequence Length:** 128 tokens
+- **Resume Training:** Supported with checkpoint management
+## Usage
+### Basic Usage
+```python
+from transformers import MBartForConditionalGeneration, AutoTokenizer
+import torch
+# Load model and tokenizer
+model_name = "socheatasokhachan/khmerhomophonecorrector"
+model = MBartForConditionalGeneration.from_pretrained(model_name)
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+# Set device
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model = model.to(device)
+model.eval()
+# Example text with homophones
+text = "ខ្ញុំកំពង់នូវសកលវិទ្យាល័យ"  # Input with homophone error
+# Preprocess text (word segmentation)
+from khmer_nltk import word_tokenize
+segmented_text = " ".join(word_tokenize(text))
+# Prepare input
+input_text = f"{segmented_text} </s> <2km>"
+inputs = tokenizer(
+    input_text,
+    return_tensors="pt",
+    padding=True,
+    truncation=True,
+    max_length=1024,
+    add_special_tokens=True
+)
+# Move to device
+inputs = {k: v.to(device) for k, v in inputs.items()}
+# Generate correction
+with torch.no_grad():
+    outputs = model.generate(
+        **inputs,
+        max_length=1024,
+        num_beams=5,
+        early_stopping=True,
+        do_sample=False,
+        no_repeat_ngram_size=3,
+        forced_bos_token_id=32000,
+        forced_eos_token_id=32001,
+        length_penalty=1.0,
+        temperature=1.0
+    )
+# Decode output
+corrected = tokenizer.decode(outputs[0], skip_special_tokens=True)
+corrected = corrected.replace("</s>", "").replace("<2km>", "").replace("▂", " ").strip()
+print(f"Original: {text}")
+print(f"Corrected: {corrected}")
+# Expected output: ខ្ញុំកំពុងនៅសកលវិទ្យាល័យ
+```
+### Using with Streamlit
+```python
+import streamlit as st
+from transformers import MBartForConditionalGeneration, AutoTokenizer
+@st.cache_resource
+def load_model():
+    model = MBartForConditionalGeneration.from_pretrained("socheatasokhachan/khmerhomophonecorrector")
+    tokenizer = AutoTokenizer.from_pretrained("socheatasokhachan/khmerhomophonecorrector")
+    return model, tokenizer
+# Load model
+model, tokenizer = load_model()
+# Streamlit interface
+st.title("Khmer Homophone Corrector")
+user_input = st.text_area("Enter Khmer text:")
+if st.button("Correct"):
+    # Process text and display results
+```
+## Model Architecture
+- **Base Model:** PrahokBART (Khmer-specific BART variant)
+- **Architecture:** Sequence-to-Sequence Transformer
+- **Max Sequence Length:** 128 tokens
+- **Special Features:** Khmer word segmentation and normalization
+- **Tokenization:** SentencePiece with Khmer-specific preprocessing
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{sokhachan2025khmerhomophonecorrector,
+  title={Khmer Homophone Corrector: A Fine-tuned PrahokBART Model for Khmer Text Correction},
+  author={Socheata Sokhachan},
+  year={2024},
+  url={https://huggingface.co/socheatasokhachan/khmerhomophonecorrector}
+}
+```
+## Related Research
+This model builds upon and fine-tunes the PrahokBART model:
+**PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation**
+- Authors: Hour Kaing, Raj Dabre, Haiyue Song, Van-Hien Tran, Hideki Tanaka, Masao Utiyama
+- Published: COLING 2025
+- DOI: [https://aclanthology.org/2025.coling-main.87.pdf](https://aclanthology.org/2025.coling-main.87.pdf)
+- Base Model: [https://huggingface.co/nict-astrec-att/prahokbart_big](https://huggingface.co/nict-astrec-att/prahokbart_big)
+## Acknowledgments
+- The PrahokBART research team for the base model
+- Hugging Face for the transformers library
+- The Khmer NLP community for language resources
+- Streamlit for the web framework
+- Contributors to the Khmer language processing tools
+---
+**Note:** This model is specifically designed for Khmer language homophone correction and may not work optimally with other languages or tasks.