--- language: - km license: mit tags: - khmer - homophone-correction - text-generation - seq2seq - prahokbart datasets: - custom-khmer-homophone metrics: - bleu - wer pipeline_tag: text2text-generation base_model: - nict-astrec-att/prahokbart_big --- # Khmer Homophone Corrector A fine-tuned PrahokBART model specifically designed for correcting homophones in Khmer text. This model builds upon PrahokBART, a pre-trained sequence-to-sequence model for Khmer natural language generation, and addresses the unique challenges of Khmer language processing, including word boundary issues and homophone confusion. ## Model Description - **Developed by:** [Socheata Sokhachan](https://github.com/SocheataSokhaChan22) - **Model type:** PrahokBART (fine-tuned for homophone correction) - **Base Model:** [PrahokBART](https://huggingface.co/nict-astrec-att/prahokbart_big) - **Language:** Khmer (km) - **Repository:** [GitHub](https://github.com/SocheataSokhaChan22/khmerhomophonecorrector) - **Live Demo:** [Streamlit App](https://khmerhomophonecorrector.streamlit.app) ## Intended Uses & Limitations ### Intended Use Cases - **Homophone Correction:** Correcting commonly confused Khmer homophones in text - **Educational Applications:** Helping students learn proper Khmer spelling - **Text Preprocessing:** Improving text quality for downstream Khmer NLP tasks - **Content Creation:** Assisting writers in producing error-free Khmer content ### Limitations - **Language Specific:** Only works with Khmer text - **Homophone Focus:** Designed specifically for homophone correction, not general grammar or spelling - **Context Dependency:** May require surrounding context for optimal corrections - **Training Data Scope:** Limited to the homophone pairs in the training dataset ## Training and Evaluation Data ### Training Data - **Dataset:** Custom Khmer homophone dataset - **Size:** 268+ homophone groups - **Coverage:** Common Khmer homophones across different word categories - **Preprocessing:** Word segmentation using Khmer NLP tools - **Format:** JSON with input-target pairs ### Evaluation Data - **Test Set:** Homophone pairs not seen during training - **Metrics:** BLEU score, WER, and human evaluation - **Validation:** Cross-validation on homophone groups ### Data Preprocessing 1. **Word Segmentation:** Using Khmer word tokenization (`khmer_nltk.word_tokenize`) 2. **Text Normalization:** Standardizing text format with special tokens 3. **Special Tokens:** Adding ` <2km>` for input and `<2km> ... ` for target 4. **Sequence Format:** Converting to sequence-to-sequence format 5. **Padding:** Max length 128 tokens with padding ## Training Results ### Performance Metrics - **BLEU-1 Score:** 99.5398 - **BLEU-2 Score:** 99.162 - **BLEU-3 Score:** 98.8093 - **BLEU-4 Score:** 98.4861 - **WER (Word Error Rate):** 0.008 - **Human Evaluation Score:** 0.008 - **Final Training Loss:** 0.0091 - **Final Validation Loss:** 0.023525 ### Training Analysis The model demonstrates exceptional performance and training characteristics: - **Rapid Convergence:** Training loss decreased dramatically from 0.6786 in epoch 1 to 0.0091 in epoch 40, showing excellent learning progression - **Stable Validation:** Validation loss stabilized around 0.023 after epoch 15, indicating consistent generalization performance - **Outstanding Accuracy:** Achieved exceptional BLEU scores with BLEU-1 reaching 99.54% and BLEU-4 at 98.49%, demonstrating near-perfect homophone correction - **Minimal Error Rate:** WER of 0.008 indicates extremely low word error rate, making the model highly reliable for practical applications - **No Overfitting:** The small and consistent gap between training (0.0091) and validation loss (0.0235) suggests excellent generalization without overfitting - **Early Performance:** Remarkably, the model achieved its best BLEU scores and WER as early as epoch 1, indicating the effectiveness of the PrahokBART base model for Khmer homophone correction ### Training Configuration - **Base Model:** PrahokBART (from [nict-astrec-att/prahokbart_big](https://huggingface.co/nict-astrec-att/prahokbart_big)) - **Model Architecture:** PrahokBART (Khmer-specific BART variant) - **Training Framework:** Hugging Face Transformers - **Optimizer:** AdamW - **Learning Rate:** 3e-5 - **Batch Size:** 32 (per device) - **Training Epochs:** 40 - **Warmup Ratio:** 0.1 - **Weight Decay:** 0.01 - **Mixed Precision:** FP16 enabled - **Evaluation Strategy:** Every epoch - **Save Strategy:** Every epoch (best 2 checkpoints) - **Max Sequence Length:** 128 tokens - **Resume Training:** Supported with checkpoint management ## Usage ### Basic Usage ```python from transformers import MBartForConditionalGeneration, AutoTokenizer import torch # Load model and tokenizer model_name = "socheatasokhachan/khmerhomophonecorrector" model = MBartForConditionalGeneration.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) # Set device device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = model.to(device) model.eval() # Example text with homophones text = "ខ្ញុំកំពង់នូវសកលវិទ្យាល័យ" # Input with homophone error # Preprocess text (word segmentation) from khmer_nltk import word_tokenize segmented_text = " ".join(word_tokenize(text)) # Prepare input input_text = f"{segmented_text} <2km>" inputs = tokenizer( input_text, return_tensors="pt", padding=True, truncation=True, max_length=1024, add_special_tokens=True ) # Move to device inputs = {k: v.to(device) for k, v in inputs.items()} # Generate correction with torch.no_grad(): outputs = model.generate( **inputs, max_length=1024, num_beams=5, early_stopping=True, do_sample=False, no_repeat_ngram_size=3, forced_bos_token_id=32000, forced_eos_token_id=32001, length_penalty=1.0, temperature=1.0 ) # Decode output corrected = tokenizer.decode(outputs[0], skip_special_tokens=True) corrected = corrected.replace("", "").replace("<2km>", "").replace("▂", " ").strip() print(f"Original: {text}") print(f"Corrected: {corrected}") # Expected output: ខ្ញុំកំពុងនៅសកលវិទ្យាល័យ ``` ### Using with Streamlit ```python import streamlit as st from transformers import MBartForConditionalGeneration, AutoTokenizer @st.cache_resource def load_model(): model = MBartForConditionalGeneration.from_pretrained("socheatasokhachan/khmerhomophonecorrector") tokenizer = AutoTokenizer.from_pretrained("socheatasokhachan/khmerhomophonecorrector") return model, tokenizer # Load model model, tokenizer = load_model() # Streamlit interface st.title("Khmer Homophone Corrector") user_input = st.text_area("Enter Khmer text:") if st.button("Correct"): # Process text and display results ``` ## Model Architecture - **Base Model:** PrahokBART (Khmer-specific BART variant) - **Architecture:** Sequence-to-Sequence Transformer - **Max Sequence Length:** 128 tokens - **Special Features:** Khmer word segmentation and normalization - **Tokenization:** SentencePiece with Khmer-specific preprocessing ## Citation If you use this model in your research, please cite: ```bibtex @misc{sokhachan2025khmerhomophonecorrector, title={Khmer Homophone Corrector: A Fine-tuned PrahokBART Model for Khmer Text Correction}, author={Socheata Sokhachan}, year={2024}, url={https://huggingface.co/socheatasokhachan/khmerhomophonecorrector} } ``` ## Related Research This model builds upon and fine-tunes the PrahokBART model: **PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation** - Authors: Hour Kaing, Raj Dabre, Haiyue Song, Van-Hien Tran, Hideki Tanaka, Masao Utiyama - Published: COLING 2025 - DOI: [https://aclanthology.org/2025.coling-main.87.pdf](https://aclanthology.org/2025.coling-main.87.pdf) - Base Model: [https://huggingface.co/nict-astrec-att/prahokbart_big](https://huggingface.co/nict-astrec-att/prahokbart_big) ## Acknowledgments - The PrahokBART research team for the base model - Hugging Face for the transformers library - The Khmer NLP community for language resources - Streamlit for the web framework - Contributors to the Khmer language processing tools --- **Note:** This model is specifically designed for Khmer language homophone correction and may not work optimally with other languages or tasks.