File size: 6,821 Bytes
f3c9438
b60062e
 
 
8f15b4e
b60062e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c37235b
 
b60062e
c37235b
b60062e
c37235b
a5b99db
c37235b
b60062e
 
 
8f15b4e
b60062e
 
 
c37235b
 
 
b60062e
 
 
 
 
 
c37235b
b60062e
 
 
 
 
 
 
 
c37235b
b60062e
c37235b
 
b60062e
 
 
 
 
 
 
 
 
 
 
c37235b
 
 
b60062e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c37235b
 
 
 
b60062e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c37235b
 
 
b60062e
 
c37235b
b60062e
 
 
 
 
 
c37235b
 
b60062e
 
 
c37235b
b60062e
c37235b
b60062e
 
 
 
c37235b
b60062e
c37235b
b60062e
 
 
 
 
c37235b
b60062e
 
 
 
 
c37235b
b60062e
 
 
 
c37235b
 
 
b60062e
 
 
 
c37235b
b60062e
c37235b
b60062e
 
 
 
 
 
c37235b
b60062e
c37235b
b60062e
c37235b
 
 
b60062e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
---
language:
- en
- km
license: cc-by-nc-4.0
base_model: facebook/nllb-200-distilled-600M
tags:
- translation
- knowledge-distillation
- nllb
- english
- khmer
- seq2seq
- production-ready
datasets:
- mutiyama/alt
metrics:
- chrf
- bertscore
model-index:
- name: nllb_350M_en_km_v10
  results:
  - task:
      type: translation
      name: Machine Translation
    dataset:
      name: Asian Language Treebank (ALT)
      type: mutiyama/alt
    metrics:
    - type: chrf
      value: 38.83
    - type: bertscore
      value: 0.8608
pipeline_tag: translation
---

# NLLB-350M-EN-KM-v10

## Model Description

This model is a compact English-to-Khmer neural machine translation model created through knowledge distillation from NLLB-200. This is the **research evaluation version** with full 10-epoch training, achieving competitive translation quality with 42% fewer parameters than the baseline.

- **Developed by:** Chealyfey Vutha
- **Model type:** Sequence-to-sequence transformer for machine translation
- **Language(s):** English to Khmer (en → km)
- **License:** CC-BY-NC 4.0
- **Base model:** facebook/nllb-200-distilled-600M
- **Teacher model:** facebook/nllb-200-1.3B
- **Parameters:** 350M (42% reduction from 600M baseline)

## Model Details

### Architecture
- **Encoder layers:** 3 (reduced from 12)
- **Decoder layers:** 3 (reduced from 12)
- **Hidden size:** 1024
- **Attention heads:** 16
- **Total parameters:** ~350M

### Training Procedure
- **Distillation method:** Temperature-scaled knowledge distillation
- **Teacher model:** NLLB-200-1.3B
- **Temperature:** 5.0
- **Lambda (loss weighting):** 0.5
- **Training epochs:** 10 (full training)
- **Training data:** 316,110 English-Khmer pairs (generated via DeepSeek API)
- **Hardware:** NVIDIA A100-SXM4-80GB

## Intended Uses

### Direct Use
This model is intended for:
- Production English-to-Khmer translation applications
- Research on efficient neural machine translation
- Cambodian language technology development
- Cultural preservation through digital translation tools

### Downstream Use
- Integration into mobile translation apps
- Website localization services
- Educational language learning platforms
- Government and NGO translation services in Cambodia

## How to Get Started with the Model

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig

# Configuration
CONFIG = {
"model_name": "lyfeyvutha/nllb_350M_en_km_v10",
"tokenizer_name": "facebook/nllb-200-distilled-600M",
"source_lang": "eng_Latn",
"target_lang": "khm_Khmr",
"max_length": 128
}

# Load model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(CONFIG["model_name"])
tokenizer = AutoTokenizer.from_pretrained(
CONFIG["tokenizer_name"],
src_lang=CONFIG["source_lang"],
tgt_lang=CONFIG["target_lang"]
)

# Set up generation configuration
khm_token_id = tokenizer.convert_tokens_to_ids(CONFIG["target_lang"])
generation_config = GenerationConfig(
max_length=CONFIG["max_length"],
forced_bos_token_id=khm_token_id
)

# Translate
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, generation_config=generation_config)
translation = tokenizer.decode(outputs, skip_special_tokens=True)
print(translation)
```

## Training Details

### Training Data
- **Dataset size:** 316,110 English-Khmer sentence pairs
- **Data source:** Synthetic data generated using DeepSeek translation API
- **Preprocessing:** Tokenized using NLLB-200 tokenizer with max length 128

### Training Hyperparameters
- **Batch size:** 48
- **Learning rate:** 3e-5
- **Optimizer:** AdamW
- **LR scheduler:** Cosine
- **Training epochs:** 10
- **Hardware:** NVIDIA A100-SXM4-80GB with CUDA 12.8

### Training Progress
| Epoch | Training Loss | Validation Loss |
|-------|---------------|-----------------|
| 1 | 0.658600 | 0.674992 |
| 2 | 0.534500 | 0.596366 |
| 3 | 0.484700 | 0.566999 |
| 4 | 0.453800 | 0.549162 |
| 5 | 0.436300 | 0.542330 |
| 6 | 0.432900 | 0.536817 |
| 7 | 0.421000 | 0.534668 |
| 8 | 0.412800 | 0.532001 |
| 9 | 0.417400 | 0.533419 |
| 10 | 0.413200 | 0.531947 |

## Evaluation

### Testing Data
The model was evaluated on the Asian Language Treebank (ALT) corpus, containing manually translated English-Khmer pairs from English Wikinews articles.

### Metrics
| Metric | Our Model (350M) | Baseline (600M) | Improvement |
|--------|------------------|-----------------|-------------|
| chrF Score | 38.83 | 43.88 | -5.05 points |
| BERTScore F1 | 0.8608 | 0.8573 | +0.0035 |
| Parameters | 350M | 600M | -42% |

### Results
- Achieves 88.5% of baseline chrF performance with 42% fewer parameters
- Actually improves on BERTScore F1, indicating better semantic similarity
- Significant computational efficiency gains for deployment scenarios

## Performance Comparison

| Model | Parameters | chrF Score | BERTScore F1 | Efficiency Gain |
|-------|------------|------------|--------------|-----------------|
| **NLLB-350M-EN-KM (Ours)** | 350M | 38.83 | 0.8608 | 42% smaller |
| NLLB-200-Distilled-600M | 600M | 43.88 | 0.8573 | Baseline |

## Limitations and Bias

### Limitations
- **Performance trade-off:** 5-point chrF decrease compared to larger baseline
- **Synthetic training data:** May not capture all real-world linguistic variations
- **Domain dependency:** Performance may vary across different text types
- **Low-resource constraints:** Limited by available English-Khmer parallel data

### Bias Considerations
- Training data generated via translation API may inherit source model biases
- Limited representation of Khmer dialects and regional variations
- Potential gender, cultural, and socioeconomic biases in translation outputs
- Urban vs. rural language usage patterns may not be equally represented

### Ethical Considerations
- Model designed to support Cambodian language preservation and digital inclusion
- Users should validate translations for sensitive or critical applications
- Consider cultural context when deploying in official or educational settings

## Environmental Impact

- **Hardware:** Training performed on single NVIDIA A100-SXM4-80GB
- **Training time:** Approximately 10 hours for full training
- **Energy efficiency:** Significantly more efficient than training from scratch
- **Deployment efficiency:** 42% reduction in computational requirements

## Citation

@misc{nllb350m_en_km_v10_2025,
title={NLLB-350M-EN-KM-v10: Efficient English-Khmer Neural Machine Translation via Knowledge Distillation},
author={Chealyfey Vutha},
year={2025},
url={https://huggingface.co/lyfeyvutha/nllb_350M_en_km_v10}
}

## Acknowledgments

This work builds upon Meta's NLLB-200 models and uses the Asian Language Treebank (ALT) corpus for evaluation.

## Model Card Contact

For questions or feedback about this model card: [email protected]