Update README.md

c8f99b2 verified about 1 month ago

8.52 kB

	---
	language:
	- km
	license: mit
	tags:
	- khmer
	- homophone-correction
	- text-generation
	- seq2seq
	- prahokbart
	datasets:
	- custom-khmer-homophone
	metrics:
	- bleu
	- wer
	pipeline_tag: text2text-generation
	base_model:
	- nict-astrec-att/prahokbart_big
	---

	# Khmer Homophone Corrector

	A fine-tuned PrahokBART model specifically designed for correcting homophones in Khmer text. This model builds upon PrahokBART, a pre-trained sequence-to-sequence model for Khmer natural language generation, and addresses the unique challenges of Khmer language processing, including word boundary issues and homophone confusion.

	## Model Description

	- Developed by: [Socheata Sokhachan](https://github.com/SocheataSokhaChan22)
	- Model type: PrahokBART (fine-tuned for homophone correction)
	- Base Model: [PrahokBART](https://huggingface.co/nict-astrec-att/prahokbart_big)
	- Language: Khmer (km)
	- Repository: [GitHub](https://github.com/SocheataSokhaChan22/khmerhomophonecorrector)
	- Live Demo: [Streamlit App](https://khmerhomophonecorrector.streamlit.app)

	## Intended Uses & Limitations

	### Intended Use Cases
	- Homophone Correction: Correcting commonly confused Khmer homophones in text
	- Educational Applications: Helping students learn proper Khmer spelling
	- Text Preprocessing: Improving text quality for downstream Khmer NLP tasks
	- Content Creation: Assisting writers in producing error-free Khmer content

	### Limitations
	- Language Specific: Only works with Khmer text
	- Homophone Focus: Designed specifically for homophone correction, not general grammar or spelling
	- Context Dependency: May require surrounding context for optimal corrections
	- Training Data Scope: Limited to the homophone pairs in the training dataset

	## Training and Evaluation Data

	### Training Data
	- Dataset: Custom Khmer homophone dataset
	- Size: 268+ homophone groups
	- Coverage: Common Khmer homophones across different word categories
	- Preprocessing: Word segmentation using Khmer NLP tools
	- Format: JSON with input-target pairs

	### Evaluation Data
	- Test Set: Homophone pairs not seen during training
	- Metrics: BLEU score, WER, and human evaluation
	- Validation: Cross-validation on homophone groups

	### Data Preprocessing
	1. Word Segmentation: Using Khmer word tokenization (`khmer_nltk.word_tokenize`)
	2. Text Normalization: Standardizing text format with special tokens
	3. Special Tokens: Adding `</s> <2km>` for input and `<2km> ... </s>` for target
	4. Sequence Format: Converting to sequence-to-sequence format
	5. Padding: Max length 128 tokens with padding

	## Training Results

	### Performance Metrics
	- BLEU-1 Score: 99.5398
	- BLEU-2 Score: 99.162
	- BLEU-3 Score: 98.8093
	- BLEU-4 Score: 98.4861
	- WER (Word Error Rate): 0.008
	- Human Evaluation Score: 0.008
	- Final Training Loss: 0.0091
	- Final Validation Loss: 0.023525

	### Training Analysis
	The model demonstrates exceptional performance and training characteristics:

	- Rapid Convergence: Training loss decreased dramatically from 0.6786 in epoch 1 to 0.0091 in epoch 40, showing excellent learning progression
	- Stable Validation: Validation loss stabilized around 0.023 after epoch 15, indicating consistent generalization performance
	- Outstanding Accuracy: Achieved exceptional BLEU scores with BLEU-1 reaching 99.54% and BLEU-4 at 98.49%, demonstrating near-perfect homophone correction
	- Minimal Error Rate: WER of 0.008 indicates extremely low word error rate, making the model highly reliable for practical applications
	- No Overfitting: The small and consistent gap between training (0.0091) and validation loss (0.0235) suggests excellent generalization without overfitting
	- Early Performance: Remarkably, the model achieved its best BLEU scores and WER as early as epoch 1, indicating the effectiveness of the PrahokBART base model for Khmer homophone correction

	### Training Configuration
	- Base Model: PrahokBART (from [nict-astrec-att/prahokbart_big](https://huggingface.co/nict-astrec-att/prahokbart_big))
	- Model Architecture: PrahokBART (Khmer-specific BART variant)
	- Training Framework: Hugging Face Transformers
	- Optimizer: AdamW
	- Learning Rate: 3e-5
	- Batch Size: 32 (per device)
	- Training Epochs: 40
	- Warmup Ratio: 0.1
	- Weight Decay: 0.01
	- Mixed Precision: FP16 enabled
	- Evaluation Strategy: Every epoch
	- Save Strategy: Every epoch (best 2 checkpoints)
	- Max Sequence Length: 128 tokens
	- Resume Training: Supported with checkpoint management

	## Usage

	### Basic Usage

	```python
	from transformers import MBartForConditionalGeneration, AutoTokenizer
	import torch

	# Load model and tokenizer
	model_name = "socheatasokhachan/khmerhomophonecorrector"
	model = MBartForConditionalGeneration.from_pretrained(model_name)
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	# Set device
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model = model.to(device)
	model.eval()

	# Example text with homophones
	text = "ខ្ញុំកំពង់នូវសកលវិទ្យាល័យ" # Input with homophone error

	# Preprocess text (word segmentation)
	from khmer_nltk import word_tokenize
	segmented_text = " ".join(word_tokenize(text))

	# Prepare input
	input_text = f"{segmented_text} </s> <2km>"
	inputs = tokenizer(
	input_text,
	return_tensors="pt",
	padding=True,
	truncation=True,
	max_length=1024,
	add_special_tokens=True
	)

	# Move to device
	inputs = {k: v.to(device) for k, v in inputs.items()}

	# Generate correction
	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_length=1024,
	num_beams=5,
	early_stopping=True,
	do_sample=False,
	no_repeat_ngram_size=3,
	forced_bos_token_id=32000,
	forced_eos_token_id=32001,
	length_penalty=1.0,
	temperature=1.0
	)

	# Decode output
	corrected = tokenizer.decode(outputs[0], skip_special_tokens=True)
	corrected = corrected.replace("</s>", "").replace("<2km>", "").replace("▂", " ").strip()

	print(f"Original: {text}")
	print(f"Corrected: {corrected}")
	# Expected output: ខ្ញុំកំពុងនៅសកលវិទ្យាល័យ
	```

	### Using with Streamlit

	```python
	import streamlit as st
	from transformers import MBartForConditionalGeneration, AutoTokenizer

	@st.cache_resource
	def load_model():
	model = MBartForConditionalGeneration.from_pretrained("socheatasokhachan/khmerhomophonecorrector")
	tokenizer = AutoTokenizer.from_pretrained("socheatasokhachan/khmerhomophonecorrector")
	return model, tokenizer

	# Load model
	model, tokenizer = load_model()

	# Streamlit interface
	st.title("Khmer Homophone Corrector")
	user_input = st.text_area("Enter Khmer text:")
	if st.button("Correct"):
	# Process text and display results
	```

	## Model Architecture

	- Base Model: PrahokBART (Khmer-specific BART variant)
	- Architecture: Sequence-to-Sequence Transformer
	- Max Sequence Length: 128 tokens
	- Special Features: Khmer word segmentation and normalization
	- Tokenization: SentencePiece with Khmer-specific preprocessing

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{sokhachan2025khmerhomophonecorrector,
	title={Khmer Homophone Corrector: A Fine-tuned PrahokBART Model for Khmer Text Correction},
	author={Socheata Sokhachan},
	year={2024},
	url={https://huggingface.co/socheatasokhachan/khmerhomophonecorrector}
	}
	```

	## Related Research

	This model builds upon and fine-tunes the PrahokBART model:

	PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation
	- Authors: Hour Kaing, Raj Dabre, Haiyue Song, Van-Hien Tran, Hideki Tanaka, Masao Utiyama
	- Published: COLING 2025
	- DOI: [https://aclanthology.org/2025.coling-main.87.pdf](https://aclanthology.org/2025.coling-main.87.pdf)
	- Base Model: [https://huggingface.co/nict-astrec-att/prahokbart_big](https://huggingface.co/nict-astrec-att/prahokbart_big)

	## Acknowledgments

	- The PrahokBART research team for the base model
	- Hugging Face for the transformers library
	- The Khmer NLP community for language resources
	- Streamlit for the web framework
	- Contributors to the Khmer language processing tools
	---

	Note: This model is specifically designed for Khmer language homophone correction and may not work optimally with other languages or tasks.