NLLB-350M-EN-KM-v10

Model Description

This model is a compact English-to-Khmer neural machine translation model created through knowledge distillation from NLLB-200. This is the research evaluation version with full 10-epoch training, achieving competitive translation quality with 42% fewer parameters than the baseline.

Developed by: Chealyfey Vutha
Model type: Sequence-to-sequence transformer for machine translation
Language(s): English to Khmer (en → km)
License: CC-BY-NC 4.0
Base model: facebook/nllb-200-distilled-600M
Teacher model: facebook/nllb-200-1.3B
Parameters: 350M (42% reduction from 600M baseline)

Model Details

Architecture

Encoder layers: 3 (reduced from 12)
Decoder layers: 3 (reduced from 12)
Hidden size: 1024
Attention heads: 16
Total parameters: ~350M

Training Procedure

Distillation method: Temperature-scaled knowledge distillation
Teacher model: NLLB-200-1.3B
Temperature: 5.0
Lambda (loss weighting): 0.5
Training epochs: 10 (full training)
Training data: 316,110 English-Khmer pairs (generated via DeepSeek API)
Hardware: NVIDIA A100-SXM4-80GB

Intended Uses

Direct Use

This model is intended for:

Production English-to-Khmer translation applications
Research on efficient neural machine translation
Cambodian language technology development
Cultural preservation through digital translation tools

Downstream Use

Integration into mobile translation apps
Website localization services
Educational language learning platforms
Government and NGO translation services in Cambodia

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig

# Configuration
CONFIG = {
"model_name": "lyfeyvutha/nllb_350M_en_km_v10",
"tokenizer_name": "facebook/nllb-200-distilled-600M",
"source_lang": "eng_Latn",
"target_lang": "khm_Khmr",
"max_length": 128
}

# Load model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(CONFIG["model_name"])
tokenizer = AutoTokenizer.from_pretrained(
CONFIG["tokenizer_name"],
src_lang=CONFIG["source_lang"],
tgt_lang=CONFIG["target_lang"]
)

# Set up generation configuration
khm_token_id = tokenizer.convert_tokens_to_ids(CONFIG["target_lang"])
generation_config = GenerationConfig(
max_length=CONFIG["max_length"],
forced_bos_token_id=khm_token_id
)

# Translate
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, generation_config=generation_config)
translation = tokenizer.decode(outputs, skip_special_tokens=True)
print(translation)

Training Details

Training Data

Dataset size: 316,110 English-Khmer sentence pairs
Data source: Synthetic data generated using DeepSeek translation API
Preprocessing: Tokenized using NLLB-200 tokenizer with max length 128

Training Hyperparameters

Batch size: 48
Learning rate: 3e-5
Optimizer: AdamW
LR scheduler: Cosine
Training epochs: 10
Hardware: NVIDIA A100-SXM4-80GB with CUDA 12.8

Training Progress

Epoch	Training Loss	Validation Loss
1	0.658600	0.674992
2	0.534500	0.596366
3	0.484700	0.566999
4	0.453800	0.549162
5	0.436300	0.542330
6	0.432900	0.536817
7	0.421000	0.534668
8	0.412800	0.532001
9	0.417400	0.533419
10	0.413200	0.531947

Evaluation

Testing Data

The model was evaluated on the Asian Language Treebank (ALT) corpus, containing manually translated English-Khmer pairs from English Wikinews articles.

Metrics

Metric	Our Model (350M)	Baseline (600M)	Improvement
chrF Score	38.83	43.88	-5.05 points
BERTScore F1	0.8608	0.8573	+0.0035
Parameters	350M	600M	-42%

Results

Achieves 88.5% of baseline chrF performance with 42% fewer parameters
Actually improves on BERTScore F1, indicating better semantic similarity
Significant computational efficiency gains for deployment scenarios

Performance Comparison

Model	Parameters	chrF Score	BERTScore F1	Efficiency Gain
NLLB-350M-EN-KM (Ours)	350M	38.83	0.8608	42% smaller
NLLB-200-Distilled-600M	600M	43.88	0.8573	Baseline

Limitations and Bias

Limitations

Performance trade-off: 5-point chrF decrease compared to larger baseline
Synthetic training data: May not capture all real-world linguistic variations
Domain dependency: Performance may vary across different text types
Low-resource constraints: Limited by available English-Khmer parallel data

Bias Considerations

Training data generated via translation API may inherit source model biases
Limited representation of Khmer dialects and regional variations
Potential gender, cultural, and socioeconomic biases in translation outputs
Urban vs. rural language usage patterns may not be equally represented

Ethical Considerations

Model designed to support Cambodian language preservation and digital inclusion
Users should validate translations for sensitive or critical applications
Consider cultural context when deploying in official or educational settings

Environmental Impact

Hardware: Training performed on single NVIDIA A100-SXM4-80GB
Training time: Approximately 10 hours for full training
Energy efficiency: Significantly more efficient than training from scratch
Deployment efficiency: 42% reduction in computational requirements

Citation

@misc{nllb350m_en_km_v10_2025, title={NLLB-350M-EN-KM-v10: Efficient English-Khmer Neural Machine Translation via Knowledge Distillation}, author={Chealyfey Vutha}, year={2025}, url={https://huggingface.co/lyfeyvutha/nllb_350M_en_km_v10} }

Acknowledgments

This work builds upon Meta's NLLB-200 models and uses the Asian Language Treebank (ALT) corpus for evaluation.

Model Card Contact

For questions or feedback about this model card: [email protected]

lyfeyvutha
/

nllb_350M_en_km_v10