File size: 6,821 Bytes

---
language:
- en
- km
license: cc-by-nc-4.0
base_model: facebook/nllb-200-distilled-600M
tags:
- translation
- knowledge-distillation
- nllb
- english
- khmer
- seq2seq
- production-ready
datasets:
- mutiyama/alt
metrics:
- chrf
- bertscore
model-index:
- name: nllb_350M_en_km_v10
  results:
  - task:
      type: translation
      name: Machine Translation
    dataset:
      name: Asian Language Treebank (ALT)
      type: mutiyama/alt
    metrics:
    - type: chrf
      value: 38.83
    - type: bertscore
      value: 0.8608
pipeline_tag: translation
---

# NLLB-350M-EN-KM-v10

## Model Description

This model is a compact English-to-Khmer neural machine translation model created through knowledge distillation from NLLB-200. This is the **research evaluation version** with full 10-epoch training, achieving competitive translation quality with 42% fewer parameters than the baseline.

- **Developed by:** Chealyfey Vutha
- **Model type:** Sequence-to-sequence transformer for machine translation
- **Language(s):** English to Khmer (en → km)
- **License:** CC-BY-NC 4.0
- **Base model:** facebook/nllb-200-distilled-600M
- **Teacher model:** facebook/nllb-200-1.3B
- **Parameters:** 350M (42% reduction from 600M baseline)

## Model Details

### Architecture
- **Encoder layers:** 3 (reduced from 12)
- **Decoder layers:** 3 (reduced from 12)
- **Hidden size:** 1024
- **Attention heads:** 16
- **Total parameters:** ~350M

### Training Procedure
- **Distillation method:** Temperature-scaled knowledge distillation
- **Teacher model:** NLLB-200-1.3B
- **Temperature:** 5.0
- **Lambda (loss weighting):** 0.5
- **Training epochs:** 10 (full training)
- **Training data:** 316,110 English-Khmer pairs (generated via DeepSeek API)
- **Hardware:** NVIDIA A100-SXM4-80GB

## Intended Uses

### Direct Use
This model is intended for:
- Production English-to-Khmer translation applications
- Research on efficient neural machine translation
- Cambodian language technology development
- Cultural preservation through digital translation tools

### Downstream Use
- Integration into mobile translation apps
- Website localization services
- Educational language learning platforms
- Government and NGO translation services in Cambodia

## How to Get Started with the Model

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig

# Configuration
CONFIG = {
"model_name": "lyfeyvutha/nllb_350M_en_km_v10",
"tokenizer_name": "facebook/nllb-200-distilled-600M",
"source_lang": "eng_Latn",
"target_lang": "khm_Khmr",
"max_length": 128
}

# Load model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(CONFIG["model_name"])
tokenizer = AutoTokenizer.from_pretrained(
CONFIG["tokenizer_name"],
src_lang=CONFIG["source_lang"],
tgt_lang=CONFIG["target_lang"]
)

# Set up generation configuration
khm_token_id = tokenizer.convert_tokens_to_ids(CONFIG["target_lang"])
generation_config = GenerationConfig(
max_length=CONFIG["max_length"],
forced_bos_token_id=khm_token_id
)

# Translate
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, generation_config=generation_config)
translation = tokenizer.decode(outputs, skip_special_tokens=True)
print(translation)
```

## Training Details

### Training Data
- **Dataset size:** 316,110 English-Khmer sentence pairs
- **Data source:** Synthetic data generated using DeepSeek translation API
- **Preprocessing:** Tokenized using NLLB-200 tokenizer with max length 128

### Training Hyperparameters
- **Batch size:** 48
- **Learning rate:** 3e-5
- **Optimizer:** AdamW
- **LR scheduler:** Cosine
- **Training epochs:** 10
- **Hardware:** NVIDIA A100-SXM4-80GB with CUDA 12.8

### Training Progress
| Epoch | Training Loss | Validation Loss |
|-------|---------------|-----------------|
| 1 | 0.658600 | 0.674992 |
| 2 | 0.534500 | 0.596366 |
| 3 | 0.484700 | 0.566999 |
| 4 | 0.453800 | 0.549162 |
| 5 | 0.436300 | 0.542330 |
| 6 | 0.432900 | 0.536817 |
| 7 | 0.421000 | 0.534668 |
| 8 | 0.412800 | 0.532001 |
| 9 | 0.417400 | 0.533419 |
| 10 | 0.413200 | 0.531947 |

## Evaluation

### Testing Data
The model was evaluated on the Asian Language Treebank (ALT) corpus, containing manually translated English-Khmer pairs from English Wikinews articles.

### Metrics
| Metric | Our Model (350M) | Baseline (600M) | Improvement |
|--------|------------------|-----------------|-------------|
| chrF Score | 38.83 | 43.88 | -5.05 points |
| BERTScore F1 | 0.8608 | 0.8573 | +0.0035 |
| Parameters | 350M | 600M | -42% |

### Results
- Achieves 88.5% of baseline chrF performance with 42% fewer parameters
- Actually improves on BERTScore F1, indicating better semantic similarity
- Significant computational efficiency gains for deployment scenarios

## Performance Comparison

| Model | Parameters | chrF Score | BERTScore F1 | Efficiency Gain |
|-------|------------|------------|--------------|-----------------|
| **NLLB-350M-EN-KM (Ours)** | 350M | 38.83 | 0.8608 | 42% smaller |
| NLLB-200-Distilled-600M | 600M | 43.88 | 0.8573 | Baseline |

## Limitations and Bias

### Limitations
- **Performance trade-off:** 5-point chrF decrease compared to larger baseline
- **Synthetic training data:** May not capture all real-world linguistic variations
- **Domain dependency:** Performance may vary across different text types
- **Low-resource constraints:** Limited by available English-Khmer parallel data

### Bias Considerations
- Training data generated via translation API may inherit source model biases
- Limited representation of Khmer dialects and regional variations
- Potential gender, cultural, and socioeconomic biases in translation outputs
- Urban vs. rural language usage patterns may not be equally represented

### Ethical Considerations
- Model designed to support Cambodian language preservation and digital inclusion
- Users should validate translations for sensitive or critical applications
- Consider cultural context when deploying in official or educational settings

## Environmental Impact

- **Hardware:** Training performed on single NVIDIA A100-SXM4-80GB
- **Training time:** Approximately 10 hours for full training
- **Energy efficiency:** Significantly more efficient than training from scratch
- **Deployment efficiency:** 42% reduction in computational requirements

## Citation

@misc{nllb350m_en_km_v10_2025,
title={NLLB-350M-EN-KM-v10: Efficient English-Khmer Neural Machine Translation via Knowledge Distillation},
author={Chealyfey Vutha},
year={2025},
url={https://huggingface.co/lyfeyvutha/nllb_350M_en_km_v10}
}

## Acknowledgments

This work builds upon Meta's NLLB-200 models and uses the Asian Language Treebank (ALT) corpus for evaluation.

## Model Card Contact

For questions or feedback about this model card: [email protected]