File size: 5,047 Bytes

cca3f61
979d722
 
 
32a505e
979d722
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8f007b1
979d722
8f007b1
979d722
 
cca3f61
 
979d722
cca3f61
979d722
cca3f61
979d722
cca3f61
979d722
 
 
32a505e
979d722
 
 
cca3f61
 
 
979d722
 
 
 
 
 
cca3f61
979d722
 
 
 
 
 
 
 
cca3f61
979d722
cca3f61
 
979d722
 
 
 
 
cca3f61
979d722
 
 
 
cca3f61
 
8c54dab
 
 
979d722
 
eeb6308
979d722
 
 
 
 
 
 
 
eeb6308
979d722
 
 
 
 
 
 
eeb6308
979d722
 
 
 
 
 
eeb6308
979d722
 
 
 
 
cca3f61
8c54dab
 
cca3f61
 
 
eeb6308
979d722
 
 
 
 
 
 
 
 
 
 
cca3f61
 
 
979d722
 
cca3f61
8f007b1
 
 
 
 
 
cca3f61
979d722
cca3f61
979d722
cca3f61
979d722
 
 
 
 
cca3f61
979d722
 
 
 
cca3f61
979d722
cca3f61
979d722
 
 
 
 
 
cca3f61
 
 
979d722

---
language:
- en
- km
license: cc-by-nc-4.0
base_model: facebook/nllb-200-distilled-600M
tags:
- translation
- knowledge-distillation
- nllb
- english
- khmer
- seq2seq
datasets:
- mutiyama/alt
metrics:
- chrf
- bertscore
model-index:
- name: nllb_350M_en_km_v1
  results:
  - task:
      type: translation
      name: Machine Translation
    dataset:
      name: Asian Language Treebank (ALT)
      type: mutiyama/alt
    metrics:
    - type: chrf
      value: 21.3502
    - type: bertscore
      value: 0.8983
pipeline_tag: translation
new_version: lyfeyvutha/nllb_350M_en_km_v10
---

# NLLB-350M-EN-KM-v1

## Model Description

This model is a compact English-to-Khmer neural machine translation model created through knowledge distillation from NLLB-200. This is the **proof-of-concept version** (1 epoch) demonstrating the feasibility of the distillation approach.

- **Developed by:** Chealyfey Vutha
- **Model type:** Sequence-to-sequence transformer for machine translation
- **Language(s):** English to Khmer (en → km)
- **License:** CC-BY-NC 4.0
- **Base model:** facebook/nllb-200-distilled-600M
- **Teacher model:** facebook/nllb-200-1.3B
- **Parameters:** 350M (42% reduction from 600M baseline)

## Model Details

### Architecture
- **Encoder layers:** 3 (reduced from 12)
- **Decoder layers:** 3 (reduced from 12)
- **Hidden size:** 1024
- **Attention heads:** 16
- **Total parameters:** ~350M

### Training Procedure
- **Distillation method:** Temperature-scaled knowledge distillation
- **Teacher model:** NLLB-200-1.3B
- **Temperature:** 5.0
- **Lambda (loss weighting):** 0.5
- **Training epochs:** 1 (proof of concept)
- **Training data:** 316,110 English-Khmer pairs (generated via DeepSeek API)
- **Hardware:** NVIDIA A100-SXM4-80GB

## Intended Uses

### Direct Use
This model is intended for:
- English-to-Khmer translation tasks
- Research on knowledge distillation for low-resource languages
- Proof-of-concept demonstrations
- Computational efficiency research

### Downstream Use
- Integration into translation applications
- Fine-tuning for domain-specific translation
- Baseline for further model compression research

## How to Get Started with the Model

```python

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig

# Configuration
CONFIG = {
"model_name": "lyfeyvutha/nllb_350M_en_km_v10",
"tokenizer_name": "facebook/nllb-200-distilled-600M",
"source_lang": "eng_Latn",
"target_lang": "khm_Khmr",
"max_length": 128
}

# Load model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(CONFIG["model_name"])
tokenizer = AutoTokenizer.from_pretrained(
CONFIG["tokenizer_name"],
src_lang=CONFIG["source_lang"],
tgt_lang=CONFIG["target_lang"]
)

# Set up generation configuration
khm_token_id = tokenizer.convert_tokens_to_ids(CONFIG["target_lang"])
generation_config = GenerationConfig(
max_length=CONFIG["max_length"],
forced_bos_token_id=khm_token_id
)

# Translate
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, generation_config=generation_config)
translation = tokenizer.decode(outputs, skip_special_tokens=True)
print(translation)

```

## Training Details

### Training Data

- **Dataset size:** 316,110 English-Khmer sentence pairs
- **Data source:** Synthetic data generated using DeepSeek translation API
- **Preprocessing:** Tokenized using NLLB-200 tokenizer with max length 128

### Training Hyperparameters
- **Batch size:** 48
- **Learning rate:** 3e-5
- **Optimizer:** AdamW
- **LR scheduler:** Cosine
- **Training epochs:** 1
- **Hardware:** NVIDIA A100-SXM4-80GB with CUDA 12.8

## Evaluation

### Testing Data
The model was evaluated on the Asian Language Treebank (ALT) corpus, containing manually translated English-Khmer pairs.

### Metrics
| Metric | Value |
|--------|-------|
| chrF Score | 21.3502 |
| BERTScore F1 | 0.8983 |

### Results
This proof-of-concept model demonstrates that knowledge distillation can achieve reasonable translation quality with significantly reduced parameters (350M vs 600M baseline).

## Limitations and Bias

### Limitations
- **Limited training:** Only 1 epoch of training; performance may improve with extended training
- **Synthetic data:** Training data generated via API may not capture all linguistic nuances
- **Domain specificity:** Performance may vary across different text domains
- **Resource constraints:** Optimized for efficiency over maximum quality

### Bias Considerations
- Training data generated via translation API may inherit biases from the source model
- Limited evaluation on diverse Khmer dialects and registers
- Potential cultural and contextual biases in translation choices

## Citation

@misc{nllb350m_en_km_v1_2025,
title={NLLB-350M-EN-KM-v1: Proof of Concept English-Khmer Neural Machine Translation via Knowledge Distillation},
author={Chealyfey Vutha},
year={2025},
url={https://huggingface.co/lyfeyvutha/nllb_350M_en_km_v1}
}

## Model Card Contact

For questions or feedback about this model card: [email protected]