File size: 6,821 Bytes
f3c9438 b60062e 8f15b4e b60062e c37235b b60062e c37235b b60062e c37235b a5b99db c37235b b60062e 8f15b4e b60062e c37235b b60062e c37235b b60062e c37235b b60062e c37235b b60062e c37235b b60062e c37235b b60062e c37235b b60062e c37235b b60062e c37235b b60062e c37235b b60062e c37235b b60062e c37235b b60062e c37235b b60062e c37235b b60062e c37235b b60062e c37235b b60062e c37235b b60062e c37235b b60062e c37235b b60062e c37235b b60062e c37235b b60062e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 |
---
language:
- en
- km
license: cc-by-nc-4.0
base_model: facebook/nllb-200-distilled-600M
tags:
- translation
- knowledge-distillation
- nllb
- english
- khmer
- seq2seq
- production-ready
datasets:
- mutiyama/alt
metrics:
- chrf
- bertscore
model-index:
- name: nllb_350M_en_km_v10
results:
- task:
type: translation
name: Machine Translation
dataset:
name: Asian Language Treebank (ALT)
type: mutiyama/alt
metrics:
- type: chrf
value: 38.83
- type: bertscore
value: 0.8608
pipeline_tag: translation
---
# NLLB-350M-EN-KM-v10
## Model Description
This model is a compact English-to-Khmer neural machine translation model created through knowledge distillation from NLLB-200. This is the **research evaluation version** with full 10-epoch training, achieving competitive translation quality with 42% fewer parameters than the baseline.
- **Developed by:** Chealyfey Vutha
- **Model type:** Sequence-to-sequence transformer for machine translation
- **Language(s):** English to Khmer (en → km)
- **License:** CC-BY-NC 4.0
- **Base model:** facebook/nllb-200-distilled-600M
- **Teacher model:** facebook/nllb-200-1.3B
- **Parameters:** 350M (42% reduction from 600M baseline)
## Model Details
### Architecture
- **Encoder layers:** 3 (reduced from 12)
- **Decoder layers:** 3 (reduced from 12)
- **Hidden size:** 1024
- **Attention heads:** 16
- **Total parameters:** ~350M
### Training Procedure
- **Distillation method:** Temperature-scaled knowledge distillation
- **Teacher model:** NLLB-200-1.3B
- **Temperature:** 5.0
- **Lambda (loss weighting):** 0.5
- **Training epochs:** 10 (full training)
- **Training data:** 316,110 English-Khmer pairs (generated via DeepSeek API)
- **Hardware:** NVIDIA A100-SXM4-80GB
## Intended Uses
### Direct Use
This model is intended for:
- Production English-to-Khmer translation applications
- Research on efficient neural machine translation
- Cambodian language technology development
- Cultural preservation through digital translation tools
### Downstream Use
- Integration into mobile translation apps
- Website localization services
- Educational language learning platforms
- Government and NGO translation services in Cambodia
## How to Get Started with the Model
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig
# Configuration
CONFIG = {
"model_name": "lyfeyvutha/nllb_350M_en_km_v10",
"tokenizer_name": "facebook/nllb-200-distilled-600M",
"source_lang": "eng_Latn",
"target_lang": "khm_Khmr",
"max_length": 128
}
# Load model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(CONFIG["model_name"])
tokenizer = AutoTokenizer.from_pretrained(
CONFIG["tokenizer_name"],
src_lang=CONFIG["source_lang"],
tgt_lang=CONFIG["target_lang"]
)
# Set up generation configuration
khm_token_id = tokenizer.convert_tokens_to_ids(CONFIG["target_lang"])
generation_config = GenerationConfig(
max_length=CONFIG["max_length"],
forced_bos_token_id=khm_token_id
)
# Translate
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, generation_config=generation_config)
translation = tokenizer.decode(outputs, skip_special_tokens=True)
print(translation)
```
## Training Details
### Training Data
- **Dataset size:** 316,110 English-Khmer sentence pairs
- **Data source:** Synthetic data generated using DeepSeek translation API
- **Preprocessing:** Tokenized using NLLB-200 tokenizer with max length 128
### Training Hyperparameters
- **Batch size:** 48
- **Learning rate:** 3e-5
- **Optimizer:** AdamW
- **LR scheduler:** Cosine
- **Training epochs:** 10
- **Hardware:** NVIDIA A100-SXM4-80GB with CUDA 12.8
### Training Progress
| Epoch | Training Loss | Validation Loss |
|-------|---------------|-----------------|
| 1 | 0.658600 | 0.674992 |
| 2 | 0.534500 | 0.596366 |
| 3 | 0.484700 | 0.566999 |
| 4 | 0.453800 | 0.549162 |
| 5 | 0.436300 | 0.542330 |
| 6 | 0.432900 | 0.536817 |
| 7 | 0.421000 | 0.534668 |
| 8 | 0.412800 | 0.532001 |
| 9 | 0.417400 | 0.533419 |
| 10 | 0.413200 | 0.531947 |
## Evaluation
### Testing Data
The model was evaluated on the Asian Language Treebank (ALT) corpus, containing manually translated English-Khmer pairs from English Wikinews articles.
### Metrics
| Metric | Our Model (350M) | Baseline (600M) | Improvement |
|--------|------------------|-----------------|-------------|
| chrF Score | 38.83 | 43.88 | -5.05 points |
| BERTScore F1 | 0.8608 | 0.8573 | +0.0035 |
| Parameters | 350M | 600M | -42% |
### Results
- Achieves 88.5% of baseline chrF performance with 42% fewer parameters
- Actually improves on BERTScore F1, indicating better semantic similarity
- Significant computational efficiency gains for deployment scenarios
## Performance Comparison
| Model | Parameters | chrF Score | BERTScore F1 | Efficiency Gain |
|-------|------------|------------|--------------|-----------------|
| **NLLB-350M-EN-KM (Ours)** | 350M | 38.83 | 0.8608 | 42% smaller |
| NLLB-200-Distilled-600M | 600M | 43.88 | 0.8573 | Baseline |
## Limitations and Bias
### Limitations
- **Performance trade-off:** 5-point chrF decrease compared to larger baseline
- **Synthetic training data:** May not capture all real-world linguistic variations
- **Domain dependency:** Performance may vary across different text types
- **Low-resource constraints:** Limited by available English-Khmer parallel data
### Bias Considerations
- Training data generated via translation API may inherit source model biases
- Limited representation of Khmer dialects and regional variations
- Potential gender, cultural, and socioeconomic biases in translation outputs
- Urban vs. rural language usage patterns may not be equally represented
### Ethical Considerations
- Model designed to support Cambodian language preservation and digital inclusion
- Users should validate translations for sensitive or critical applications
- Consider cultural context when deploying in official or educational settings
## Environmental Impact
- **Hardware:** Training performed on single NVIDIA A100-SXM4-80GB
- **Training time:** Approximately 10 hours for full training
- **Energy efficiency:** Significantly more efficient than training from scratch
- **Deployment efficiency:** 42% reduction in computational requirements
## Citation
@misc{nllb350m_en_km_v10_2025,
title={NLLB-350M-EN-KM-v10: Efficient English-Khmer Neural Machine Translation via Knowledge Distillation},
author={Chealyfey Vutha},
year={2025},
url={https://huggingface.co/lyfeyvutha/nllb_350M_en_km_v10}
}
## Acknowledgments
This work builds upon Meta's NLLB-200 models and uses the Asian Language Treebank (ALT) corpus for evaluation.
## Model Card Contact
For questions or feedback about this model card: [email protected] |