|
--- |
|
language: |
|
- en |
|
- km |
|
license: cc-by-nc-4.0 |
|
base_model: facebook/nllb-200-distilled-600M |
|
tags: |
|
- translation |
|
- knowledge-distillation |
|
- nllb |
|
- english |
|
- khmer |
|
- seq2seq |
|
datasets: |
|
- mutiyama/alt |
|
metrics: |
|
- chrf |
|
- bertscore |
|
model-index: |
|
- name: nllb_350M_en_km_v1 |
|
results: |
|
- task: |
|
type: translation |
|
name: Machine Translation |
|
dataset: |
|
name: Asian Language Treebank (ALT) |
|
type: mutiyama/alt |
|
metrics: |
|
- type: chrf |
|
value: 21.3502 |
|
- type: bertscore |
|
value: 0.8983 |
|
pipeline_tag: translation |
|
new_version: lyfeyvutha/nllb_350M_en_km_v10 |
|
--- |
|
|
|
# NLLB-350M-EN-KM-v1 |
|
|
|
## Model Description |
|
|
|
This model is a compact English-to-Khmer neural machine translation model created through knowledge distillation from NLLB-200. This is the **proof-of-concept version** (1 epoch) demonstrating the feasibility of the distillation approach. |
|
|
|
- **Developed by:** Chealyfey Vutha |
|
- **Model type:** Sequence-to-sequence transformer for machine translation |
|
- **Language(s):** English to Khmer (en → km) |
|
- **License:** CC-BY-NC 4.0 |
|
- **Base model:** facebook/nllb-200-distilled-600M |
|
- **Teacher model:** facebook/nllb-200-1.3B |
|
- **Parameters:** 350M (42% reduction from 600M baseline) |
|
|
|
## Model Details |
|
|
|
### Architecture |
|
- **Encoder layers:** 3 (reduced from 12) |
|
- **Decoder layers:** 3 (reduced from 12) |
|
- **Hidden size:** 1024 |
|
- **Attention heads:** 16 |
|
- **Total parameters:** ~350M |
|
|
|
### Training Procedure |
|
- **Distillation method:** Temperature-scaled knowledge distillation |
|
- **Teacher model:** NLLB-200-1.3B |
|
- **Temperature:** 5.0 |
|
- **Lambda (loss weighting):** 0.5 |
|
- **Training epochs:** 1 (proof of concept) |
|
- **Training data:** 316,110 English-Khmer pairs (generated via DeepSeek API) |
|
- **Hardware:** NVIDIA A100-SXM4-80GB |
|
|
|
## Intended Uses |
|
|
|
### Direct Use |
|
This model is intended for: |
|
- English-to-Khmer translation tasks |
|
- Research on knowledge distillation for low-resource languages |
|
- Proof-of-concept demonstrations |
|
- Computational efficiency research |
|
|
|
### Downstream Use |
|
- Integration into translation applications |
|
- Fine-tuning for domain-specific translation |
|
- Baseline for further model compression research |
|
|
|
## How to Get Started with the Model |
|
|
|
```python |
|
|
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig |
|
|
|
# Configuration |
|
CONFIG = { |
|
"model_name": "lyfeyvutha/nllb_350M_en_km_v10", |
|
"tokenizer_name": "facebook/nllb-200-distilled-600M", |
|
"source_lang": "eng_Latn", |
|
"target_lang": "khm_Khmr", |
|
"max_length": 128 |
|
} |
|
|
|
# Load model and tokenizer |
|
model = AutoModelForSeq2SeqLM.from_pretrained(CONFIG["model_name"]) |
|
tokenizer = AutoTokenizer.from_pretrained( |
|
CONFIG["tokenizer_name"], |
|
src_lang=CONFIG["source_lang"], |
|
tgt_lang=CONFIG["target_lang"] |
|
) |
|
|
|
# Set up generation configuration |
|
khm_token_id = tokenizer.convert_tokens_to_ids(CONFIG["target_lang"]) |
|
generation_config = GenerationConfig( |
|
max_length=CONFIG["max_length"], |
|
forced_bos_token_id=khm_token_id |
|
) |
|
|
|
# Translate |
|
text = "Hello, how are you?" |
|
inputs = tokenizer(text, return_tensors="pt") |
|
outputs = model.generate(**inputs, generation_config=generation_config) |
|
translation = tokenizer.decode(outputs, skip_special_tokens=True) |
|
print(translation) |
|
|
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
- **Dataset size:** 316,110 English-Khmer sentence pairs |
|
- **Data source:** Synthetic data generated using DeepSeek translation API |
|
- **Preprocessing:** Tokenized using NLLB-200 tokenizer with max length 128 |
|
|
|
### Training Hyperparameters |
|
- **Batch size:** 48 |
|
- **Learning rate:** 3e-5 |
|
- **Optimizer:** AdamW |
|
- **LR scheduler:** Cosine |
|
- **Training epochs:** 1 |
|
- **Hardware:** NVIDIA A100-SXM4-80GB with CUDA 12.8 |
|
|
|
## Evaluation |
|
|
|
### Testing Data |
|
The model was evaluated on the Asian Language Treebank (ALT) corpus, containing manually translated English-Khmer pairs. |
|
|
|
### Metrics |
|
| Metric | Value | |
|
|--------|-------| |
|
| chrF Score | 21.3502 | |
|
| BERTScore F1 | 0.8983 | |
|
|
|
### Results |
|
This proof-of-concept model demonstrates that knowledge distillation can achieve reasonable translation quality with significantly reduced parameters (350M vs 600M baseline). |
|
|
|
## Limitations and Bias |
|
|
|
### Limitations |
|
- **Limited training:** Only 1 epoch of training; performance may improve with extended training |
|
- **Synthetic data:** Training data generated via API may not capture all linguistic nuances |
|
- **Domain specificity:** Performance may vary across different text domains |
|
- **Resource constraints:** Optimized for efficiency over maximum quality |
|
|
|
### Bias Considerations |
|
- Training data generated via translation API may inherit biases from the source model |
|
- Limited evaluation on diverse Khmer dialects and registers |
|
- Potential cultural and contextual biases in translation choices |
|
|
|
## Citation |
|
|
|
@misc{nllb350m_en_km_v1_2025, |
|
title={NLLB-350M-EN-KM-v1: Proof of Concept English-Khmer Neural Machine Translation via Knowledge Distillation}, |
|
author={Chealyfey Vutha}, |
|
year={2025}, |
|
url={https://huggingface.co/lyfeyvutha/nllb_350M_en_km_v1} |
|
} |
|
|
|
## Model Card Contact |
|
|
|
For questions or feedback about this model card: [email protected] |