File size: 5,047 Bytes
cca3f61 979d722 32a505e 979d722 8f007b1 979d722 8f007b1 979d722 cca3f61 979d722 cca3f61 979d722 cca3f61 979d722 cca3f61 979d722 32a505e 979d722 cca3f61 979d722 cca3f61 979d722 cca3f61 979d722 cca3f61 979d722 cca3f61 979d722 cca3f61 8c54dab 979d722 eeb6308 979d722 eeb6308 979d722 eeb6308 979d722 eeb6308 979d722 cca3f61 8c54dab cca3f61 eeb6308 979d722 cca3f61 979d722 cca3f61 8f007b1 cca3f61 979d722 cca3f61 979d722 cca3f61 979d722 cca3f61 979d722 cca3f61 979d722 cca3f61 979d722 cca3f61 979d722 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 |
---
language:
- en
- km
license: cc-by-nc-4.0
base_model: facebook/nllb-200-distilled-600M
tags:
- translation
- knowledge-distillation
- nllb
- english
- khmer
- seq2seq
datasets:
- mutiyama/alt
metrics:
- chrf
- bertscore
model-index:
- name: nllb_350M_en_km_v1
results:
- task:
type: translation
name: Machine Translation
dataset:
name: Asian Language Treebank (ALT)
type: mutiyama/alt
metrics:
- type: chrf
value: 21.3502
- type: bertscore
value: 0.8983
pipeline_tag: translation
new_version: lyfeyvutha/nllb_350M_en_km_v10
---
# NLLB-350M-EN-KM-v1
## Model Description
This model is a compact English-to-Khmer neural machine translation model created through knowledge distillation from NLLB-200. This is the **proof-of-concept version** (1 epoch) demonstrating the feasibility of the distillation approach.
- **Developed by:** Chealyfey Vutha
- **Model type:** Sequence-to-sequence transformer for machine translation
- **Language(s):** English to Khmer (en → km)
- **License:** CC-BY-NC 4.0
- **Base model:** facebook/nllb-200-distilled-600M
- **Teacher model:** facebook/nllb-200-1.3B
- **Parameters:** 350M (42% reduction from 600M baseline)
## Model Details
### Architecture
- **Encoder layers:** 3 (reduced from 12)
- **Decoder layers:** 3 (reduced from 12)
- **Hidden size:** 1024
- **Attention heads:** 16
- **Total parameters:** ~350M
### Training Procedure
- **Distillation method:** Temperature-scaled knowledge distillation
- **Teacher model:** NLLB-200-1.3B
- **Temperature:** 5.0
- **Lambda (loss weighting):** 0.5
- **Training epochs:** 1 (proof of concept)
- **Training data:** 316,110 English-Khmer pairs (generated via DeepSeek API)
- **Hardware:** NVIDIA A100-SXM4-80GB
## Intended Uses
### Direct Use
This model is intended for:
- English-to-Khmer translation tasks
- Research on knowledge distillation for low-resource languages
- Proof-of-concept demonstrations
- Computational efficiency research
### Downstream Use
- Integration into translation applications
- Fine-tuning for domain-specific translation
- Baseline for further model compression research
## How to Get Started with the Model
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig
# Configuration
CONFIG = {
"model_name": "lyfeyvutha/nllb_350M_en_km_v10",
"tokenizer_name": "facebook/nllb-200-distilled-600M",
"source_lang": "eng_Latn",
"target_lang": "khm_Khmr",
"max_length": 128
}
# Load model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(CONFIG["model_name"])
tokenizer = AutoTokenizer.from_pretrained(
CONFIG["tokenizer_name"],
src_lang=CONFIG["source_lang"],
tgt_lang=CONFIG["target_lang"]
)
# Set up generation configuration
khm_token_id = tokenizer.convert_tokens_to_ids(CONFIG["target_lang"])
generation_config = GenerationConfig(
max_length=CONFIG["max_length"],
forced_bos_token_id=khm_token_id
)
# Translate
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, generation_config=generation_config)
translation = tokenizer.decode(outputs, skip_special_tokens=True)
print(translation)
```
## Training Details
### Training Data
- **Dataset size:** 316,110 English-Khmer sentence pairs
- **Data source:** Synthetic data generated using DeepSeek translation API
- **Preprocessing:** Tokenized using NLLB-200 tokenizer with max length 128
### Training Hyperparameters
- **Batch size:** 48
- **Learning rate:** 3e-5
- **Optimizer:** AdamW
- **LR scheduler:** Cosine
- **Training epochs:** 1
- **Hardware:** NVIDIA A100-SXM4-80GB with CUDA 12.8
## Evaluation
### Testing Data
The model was evaluated on the Asian Language Treebank (ALT) corpus, containing manually translated English-Khmer pairs.
### Metrics
| Metric | Value |
|--------|-------|
| chrF Score | 21.3502 |
| BERTScore F1 | 0.8983 |
### Results
This proof-of-concept model demonstrates that knowledge distillation can achieve reasonable translation quality with significantly reduced parameters (350M vs 600M baseline).
## Limitations and Bias
### Limitations
- **Limited training:** Only 1 epoch of training; performance may improve with extended training
- **Synthetic data:** Training data generated via API may not capture all linguistic nuances
- **Domain specificity:** Performance may vary across different text domains
- **Resource constraints:** Optimized for efficiency over maximum quality
### Bias Considerations
- Training data generated via translation API may inherit biases from the source model
- Limited evaluation on diverse Khmer dialects and registers
- Potential cultural and contextual biases in translation choices
## Citation
@misc{nllb350m_en_km_v1_2025,
title={NLLB-350M-EN-KM-v1: Proof of Concept English-Khmer Neural Machine Translation via Knowledge Distillation},
author={Chealyfey Vutha},
year={2025},
url={https://huggingface.co/lyfeyvutha/nllb_350M_en_km_v1}
}
## Model Card Contact
For questions or feedback about this model card: [email protected] |