nllb_350M_en_km_v1 / README.md
lyfeyvutha's picture
Update README.md
32a505e verified
---
language:
- en
- km
license: cc-by-nc-4.0
base_model: facebook/nllb-200-distilled-600M
tags:
- translation
- knowledge-distillation
- nllb
- english
- khmer
- seq2seq
datasets:
- mutiyama/alt
metrics:
- chrf
- bertscore
model-index:
- name: nllb_350M_en_km_v1
results:
- task:
type: translation
name: Machine Translation
dataset:
name: Asian Language Treebank (ALT)
type: mutiyama/alt
metrics:
- type: chrf
value: 21.3502
- type: bertscore
value: 0.8983
pipeline_tag: translation
new_version: lyfeyvutha/nllb_350M_en_km_v10
---
# NLLB-350M-EN-KM-v1
## Model Description
This model is a compact English-to-Khmer neural machine translation model created through knowledge distillation from NLLB-200. This is the **proof-of-concept version** (1 epoch) demonstrating the feasibility of the distillation approach.
- **Developed by:** Chealyfey Vutha
- **Model type:** Sequence-to-sequence transformer for machine translation
- **Language(s):** English to Khmer (en → km)
- **License:** CC-BY-NC 4.0
- **Base model:** facebook/nllb-200-distilled-600M
- **Teacher model:** facebook/nllb-200-1.3B
- **Parameters:** 350M (42% reduction from 600M baseline)
## Model Details
### Architecture
- **Encoder layers:** 3 (reduced from 12)
- **Decoder layers:** 3 (reduced from 12)
- **Hidden size:** 1024
- **Attention heads:** 16
- **Total parameters:** ~350M
### Training Procedure
- **Distillation method:** Temperature-scaled knowledge distillation
- **Teacher model:** NLLB-200-1.3B
- **Temperature:** 5.0
- **Lambda (loss weighting):** 0.5
- **Training epochs:** 1 (proof of concept)
- **Training data:** 316,110 English-Khmer pairs (generated via DeepSeek API)
- **Hardware:** NVIDIA A100-SXM4-80GB
## Intended Uses
### Direct Use
This model is intended for:
- English-to-Khmer translation tasks
- Research on knowledge distillation for low-resource languages
- Proof-of-concept demonstrations
- Computational efficiency research
### Downstream Use
- Integration into translation applications
- Fine-tuning for domain-specific translation
- Baseline for further model compression research
## How to Get Started with the Model
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig
# Configuration
CONFIG = {
"model_name": "lyfeyvutha/nllb_350M_en_km_v10",
"tokenizer_name": "facebook/nllb-200-distilled-600M",
"source_lang": "eng_Latn",
"target_lang": "khm_Khmr",
"max_length": 128
}
# Load model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(CONFIG["model_name"])
tokenizer = AutoTokenizer.from_pretrained(
CONFIG["tokenizer_name"],
src_lang=CONFIG["source_lang"],
tgt_lang=CONFIG["target_lang"]
)
# Set up generation configuration
khm_token_id = tokenizer.convert_tokens_to_ids(CONFIG["target_lang"])
generation_config = GenerationConfig(
max_length=CONFIG["max_length"],
forced_bos_token_id=khm_token_id
)
# Translate
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, generation_config=generation_config)
translation = tokenizer.decode(outputs, skip_special_tokens=True)
print(translation)
```
## Training Details
### Training Data
- **Dataset size:** 316,110 English-Khmer sentence pairs
- **Data source:** Synthetic data generated using DeepSeek translation API
- **Preprocessing:** Tokenized using NLLB-200 tokenizer with max length 128
### Training Hyperparameters
- **Batch size:** 48
- **Learning rate:** 3e-5
- **Optimizer:** AdamW
- **LR scheduler:** Cosine
- **Training epochs:** 1
- **Hardware:** NVIDIA A100-SXM4-80GB with CUDA 12.8
## Evaluation
### Testing Data
The model was evaluated on the Asian Language Treebank (ALT) corpus, containing manually translated English-Khmer pairs.
### Metrics
| Metric | Value |
|--------|-------|
| chrF Score | 21.3502 |
| BERTScore F1 | 0.8983 |
### Results
This proof-of-concept model demonstrates that knowledge distillation can achieve reasonable translation quality with significantly reduced parameters (350M vs 600M baseline).
## Limitations and Bias
### Limitations
- **Limited training:** Only 1 epoch of training; performance may improve with extended training
- **Synthetic data:** Training data generated via API may not capture all linguistic nuances
- **Domain specificity:** Performance may vary across different text domains
- **Resource constraints:** Optimized for efficiency over maximum quality
### Bias Considerations
- Training data generated via translation API may inherit biases from the source model
- Limited evaluation on diverse Khmer dialects and registers
- Potential cultural and contextual biases in translation choices
## Citation
@misc{nllb350m_en_km_v1_2025,
title={NLLB-350M-EN-KM-v1: Proof of Concept English-Khmer Neural Machine Translation via Knowledge Distillation},
author={Chealyfey Vutha},
year={2025},
url={https://huggingface.co/lyfeyvutha/nllb_350M_en_km_v1}
}
## Model Card Contact
For questions or feedback about this model card: [email protected]