NLLB-350M-EN-KM-v10

Model Description

This model is a compact English-to-Khmer neural machine translation model created through knowledge distillation from NLLB-200. This is the research evaluation version with full 10-epoch training, achieving competitive translation quality with 42% fewer parameters than the baseline.

  • Developed by: Chealyfey Vutha
  • Model type: Sequence-to-sequence transformer for machine translation
  • Language(s): English to Khmer (en → km)
  • License: CC-BY-NC 4.0
  • Base model: facebook/nllb-200-distilled-600M
  • Teacher model: facebook/nllb-200-1.3B
  • Parameters: 350M (42% reduction from 600M baseline)

Model Details

Architecture

  • Encoder layers: 3 (reduced from 12)
  • Decoder layers: 3 (reduced from 12)
  • Hidden size: 1024
  • Attention heads: 16
  • Total parameters: ~350M

Training Procedure

  • Distillation method: Temperature-scaled knowledge distillation
  • Teacher model: NLLB-200-1.3B
  • Temperature: 5.0
  • Lambda (loss weighting): 0.5
  • Training epochs: 10 (full training)
  • Training data: 316,110 English-Khmer pairs (generated via DeepSeek API)
  • Hardware: NVIDIA A100-SXM4-80GB

Intended Uses

Direct Use

This model is intended for:

  • Production English-to-Khmer translation applications
  • Research on efficient neural machine translation
  • Cambodian language technology development
  • Cultural preservation through digital translation tools

Downstream Use

  • Integration into mobile translation apps
  • Website localization services
  • Educational language learning platforms
  • Government and NGO translation services in Cambodia

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig

# Configuration
CONFIG = {
"model_name": "lyfeyvutha/nllb_350M_en_km_v10",
"tokenizer_name": "facebook/nllb-200-distilled-600M",
"source_lang": "eng_Latn",
"target_lang": "khm_Khmr",
"max_length": 128
}

# Load model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(CONFIG["model_name"])
tokenizer = AutoTokenizer.from_pretrained(
CONFIG["tokenizer_name"],
src_lang=CONFIG["source_lang"],
tgt_lang=CONFIG["target_lang"]
)

# Set up generation configuration
khm_token_id = tokenizer.convert_tokens_to_ids(CONFIG["target_lang"])
generation_config = GenerationConfig(
max_length=CONFIG["max_length"],
forced_bos_token_id=khm_token_id
)

# Translate
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, generation_config=generation_config)
translation = tokenizer.decode(outputs, skip_special_tokens=True)
print(translation)

Training Details

Training Data

  • Dataset size: 316,110 English-Khmer sentence pairs
  • Data source: Synthetic data generated using DeepSeek translation API
  • Preprocessing: Tokenized using NLLB-200 tokenizer with max length 128

Training Hyperparameters

  • Batch size: 48
  • Learning rate: 3e-5
  • Optimizer: AdamW
  • LR scheduler: Cosine
  • Training epochs: 10
  • Hardware: NVIDIA A100-SXM4-80GB with CUDA 12.8

Training Progress

Epoch Training Loss Validation Loss
1 0.658600 0.674992
2 0.534500 0.596366
3 0.484700 0.566999
4 0.453800 0.549162
5 0.436300 0.542330
6 0.432900 0.536817
7 0.421000 0.534668
8 0.412800 0.532001
9 0.417400 0.533419
10 0.413200 0.531947

Evaluation

Testing Data

The model was evaluated on the Asian Language Treebank (ALT) corpus, containing manually translated English-Khmer pairs from English Wikinews articles.

Metrics

Metric Our Model (350M) Baseline (600M) Improvement
chrF Score 38.83 43.88 -5.05 points
BERTScore F1 0.8608 0.8573 +0.0035
Parameters 350M 600M -42%

Results

  • Achieves 88.5% of baseline chrF performance with 42% fewer parameters
  • Actually improves on BERTScore F1, indicating better semantic similarity
  • Significant computational efficiency gains for deployment scenarios

Performance Comparison

Model Parameters chrF Score BERTScore F1 Efficiency Gain
NLLB-350M-EN-KM (Ours) 350M 38.83 0.8608 42% smaller
NLLB-200-Distilled-600M 600M 43.88 0.8573 Baseline

Limitations and Bias

Limitations

  • Performance trade-off: 5-point chrF decrease compared to larger baseline
  • Synthetic training data: May not capture all real-world linguistic variations
  • Domain dependency: Performance may vary across different text types
  • Low-resource constraints: Limited by available English-Khmer parallel data

Bias Considerations

  • Training data generated via translation API may inherit source model biases
  • Limited representation of Khmer dialects and regional variations
  • Potential gender, cultural, and socioeconomic biases in translation outputs
  • Urban vs. rural language usage patterns may not be equally represented

Ethical Considerations

  • Model designed to support Cambodian language preservation and digital inclusion
  • Users should validate translations for sensitive or critical applications
  • Consider cultural context when deploying in official or educational settings

Environmental Impact

  • Hardware: Training performed on single NVIDIA A100-SXM4-80GB
  • Training time: Approximately 10 hours for full training
  • Energy efficiency: Significantly more efficient than training from scratch
  • Deployment efficiency: 42% reduction in computational requirements

Citation

@misc{nllb350m_en_km_v10_2025, title={NLLB-350M-EN-KM-v10: Efficient English-Khmer Neural Machine Translation via Knowledge Distillation}, author={Chealyfey Vutha}, year={2025}, url={https://huggingface.co/lyfeyvutha/nllb_350M_en_km_v10} }

Acknowledgments

This work builds upon Meta's NLLB-200 models and uses the Asian Language Treebank (ALT) corpus for evaluation.

Model Card Contact

For questions or feedback about this model card: [email protected]

Downloads last month
17
Safetensors
Model size
351M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lyfeyvutha/nllb_350M_en_km_v10

Finetuned
(152)
this model

Dataset used to train lyfeyvutha/nllb_350M_en_km_v10

Evaluation results