--- language: - en - km license: cc-by-nc-4.0 base_model: facebook/nllb-200-distilled-600M tags: - translation - knowledge-distillation - nllb - english - khmer - seq2seq - production-ready datasets: - mutiyama/alt metrics: - chrf - bertscore model-index: - name: nllb_350M_en_km_v10 results: - task: type: translation name: Machine Translation dataset: name: Asian Language Treebank (ALT) type: mutiyama/alt metrics: - type: chrf value: 38.83 - type: bertscore value: 0.8608 pipeline_tag: translation --- # NLLB-350M-EN-KM-v10 ## Model Description This model is a compact English-to-Khmer neural machine translation model created through knowledge distillation from NLLB-200. This is the **research evaluation version** with full 10-epoch training, achieving competitive translation quality with 42% fewer parameters than the baseline. - **Developed by:** Chealyfey Vutha - **Model type:** Sequence-to-sequence transformer for machine translation - **Language(s):** English to Khmer (en → km) - **License:** CC-BY-NC 4.0 - **Base model:** facebook/nllb-200-distilled-600M - **Teacher model:** facebook/nllb-200-1.3B - **Parameters:** 350M (42% reduction from 600M baseline) ## Model Details ### Architecture - **Encoder layers:** 3 (reduced from 12) - **Decoder layers:** 3 (reduced from 12) - **Hidden size:** 1024 - **Attention heads:** 16 - **Total parameters:** ~350M ### Training Procedure - **Distillation method:** Temperature-scaled knowledge distillation - **Teacher model:** NLLB-200-1.3B - **Temperature:** 5.0 - **Lambda (loss weighting):** 0.5 - **Training epochs:** 10 (full training) - **Training data:** 316,110 English-Khmer pairs (generated via DeepSeek API) - **Hardware:** NVIDIA A100-SXM4-80GB ## Intended Uses ### Direct Use This model is intended for: - Production English-to-Khmer translation applications - Research on efficient neural machine translation - Cambodian language technology development - Cultural preservation through digital translation tools ### Downstream Use - Integration into mobile translation apps - Website localization services - Educational language learning platforms - Government and NGO translation services in Cambodia ## How to Get Started with the Model ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig # Configuration CONFIG = { "model_name": "lyfeyvutha/nllb_350M_en_km_v10", "tokenizer_name": "facebook/nllb-200-distilled-600M", "source_lang": "eng_Latn", "target_lang": "khm_Khmr", "max_length": 128 } # Load model and tokenizer model = AutoModelForSeq2SeqLM.from_pretrained(CONFIG["model_name"]) tokenizer = AutoTokenizer.from_pretrained( CONFIG["tokenizer_name"], src_lang=CONFIG["source_lang"], tgt_lang=CONFIG["target_lang"] ) # Set up generation configuration khm_token_id = tokenizer.convert_tokens_to_ids(CONFIG["target_lang"]) generation_config = GenerationConfig( max_length=CONFIG["max_length"], forced_bos_token_id=khm_token_id ) # Translate text = "Hello, how are you?" inputs = tokenizer(text, return_tensors="pt") outputs = model.generate(**inputs, generation_config=generation_config) translation = tokenizer.decode(outputs, skip_special_tokens=True) print(translation) ``` ## Training Details ### Training Data - **Dataset size:** 316,110 English-Khmer sentence pairs - **Data source:** Synthetic data generated using DeepSeek translation API - **Preprocessing:** Tokenized using NLLB-200 tokenizer with max length 128 ### Training Hyperparameters - **Batch size:** 48 - **Learning rate:** 3e-5 - **Optimizer:** AdamW - **LR scheduler:** Cosine - **Training epochs:** 10 - **Hardware:** NVIDIA A100-SXM4-80GB with CUDA 12.8 ### Training Progress | Epoch | Training Loss | Validation Loss | |-------|---------------|-----------------| | 1 | 0.658600 | 0.674992 | | 2 | 0.534500 | 0.596366 | | 3 | 0.484700 | 0.566999 | | 4 | 0.453800 | 0.549162 | | 5 | 0.436300 | 0.542330 | | 6 | 0.432900 | 0.536817 | | 7 | 0.421000 | 0.534668 | | 8 | 0.412800 | 0.532001 | | 9 | 0.417400 | 0.533419 | | 10 | 0.413200 | 0.531947 | ## Evaluation ### Testing Data The model was evaluated on the Asian Language Treebank (ALT) corpus, containing manually translated English-Khmer pairs from English Wikinews articles. ### Metrics | Metric | Our Model (350M) | Baseline (600M) | Improvement | |--------|------------------|-----------------|-------------| | chrF Score | 38.83 | 43.88 | -5.05 points | | BERTScore F1 | 0.8608 | 0.8573 | +0.0035 | | Parameters | 350M | 600M | -42% | ### Results - Achieves 88.5% of baseline chrF performance with 42% fewer parameters - Actually improves on BERTScore F1, indicating better semantic similarity - Significant computational efficiency gains for deployment scenarios ## Performance Comparison | Model | Parameters | chrF Score | BERTScore F1 | Efficiency Gain | |-------|------------|------------|--------------|-----------------| | **NLLB-350M-EN-KM (Ours)** | 350M | 38.83 | 0.8608 | 42% smaller | | NLLB-200-Distilled-600M | 600M | 43.88 | 0.8573 | Baseline | ## Limitations and Bias ### Limitations - **Performance trade-off:** 5-point chrF decrease compared to larger baseline - **Synthetic training data:** May not capture all real-world linguistic variations - **Domain dependency:** Performance may vary across different text types - **Low-resource constraints:** Limited by available English-Khmer parallel data ### Bias Considerations - Training data generated via translation API may inherit source model biases - Limited representation of Khmer dialects and regional variations - Potential gender, cultural, and socioeconomic biases in translation outputs - Urban vs. rural language usage patterns may not be equally represented ### Ethical Considerations - Model designed to support Cambodian language preservation and digital inclusion - Users should validate translations for sensitive or critical applications - Consider cultural context when deploying in official or educational settings ## Environmental Impact - **Hardware:** Training performed on single NVIDIA A100-SXM4-80GB - **Training time:** Approximately 10 hours for full training - **Energy efficiency:** Significantly more efficient than training from scratch - **Deployment efficiency:** 42% reduction in computational requirements ## Citation @misc{nllb350m_en_km_v10_2025, title={NLLB-350M-EN-KM-v10: Efficient English-Khmer Neural Machine Translation via Knowledge Distillation}, author={Chealyfey Vutha}, year={2025}, url={https://huggingface.co/lyfeyvutha/nllb_350M_en_km_v10} } ## Acknowledgments This work builds upon Meta's NLLB-200 models and uses the Asian Language Treebank (ALT) corpus for evaluation. ## Model Card Contact For questions or feedback about this model card: lyfeytech@gmail.com