--- language: - en - km license: cc-by-nc-4.0 base_model: facebook/nllb-200-distilled-600M tags: - translation - knowledge-distillation - nllb - english - khmer - seq2seq datasets: - mutiyama/alt metrics: - chrf - bertscore model-index: - name: nllb_350M_en_km_v1 results: - task: type: translation name: Machine Translation dataset: name: Asian Language Treebank (ALT) type: mutiyama/alt metrics: - type: chrf value: 21.3502 - type: bertscore value: 0.8983 pipeline_tag: translation new_version: lyfeyvutha/nllb_350M_en_km_v10 --- # NLLB-350M-EN-KM-v1 ## Model Description This model is a compact English-to-Khmer neural machine translation model created through knowledge distillation from NLLB-200. This is the **proof-of-concept version** (1 epoch) demonstrating the feasibility of the distillation approach. - **Developed by:** Chealyfey Vutha - **Model type:** Sequence-to-sequence transformer for machine translation - **Language(s):** English to Khmer (en → km) - **License:** CC-BY-NC 4.0 - **Base model:** facebook/nllb-200-distilled-600M - **Teacher model:** facebook/nllb-200-1.3B - **Parameters:** 350M (42% reduction from 600M baseline) ## Model Details ### Architecture - **Encoder layers:** 3 (reduced from 12) - **Decoder layers:** 3 (reduced from 12) - **Hidden size:** 1024 - **Attention heads:** 16 - **Total parameters:** ~350M ### Training Procedure - **Distillation method:** Temperature-scaled knowledge distillation - **Teacher model:** NLLB-200-1.3B - **Temperature:** 5.0 - **Lambda (loss weighting):** 0.5 - **Training epochs:** 1 (proof of concept) - **Training data:** 316,110 English-Khmer pairs (generated via DeepSeek API) - **Hardware:** NVIDIA A100-SXM4-80GB ## Intended Uses ### Direct Use This model is intended for: - English-to-Khmer translation tasks - Research on knowledge distillation for low-resource languages - Proof-of-concept demonstrations - Computational efficiency research ### Downstream Use - Integration into translation applications - Fine-tuning for domain-specific translation - Baseline for further model compression research ## How to Get Started with the Model ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig # Configuration CONFIG = { "model_name": "lyfeyvutha/nllb_350M_en_km_v10", "tokenizer_name": "facebook/nllb-200-distilled-600M", "source_lang": "eng_Latn", "target_lang": "khm_Khmr", "max_length": 128 } # Load model and tokenizer model = AutoModelForSeq2SeqLM.from_pretrained(CONFIG["model_name"]) tokenizer = AutoTokenizer.from_pretrained( CONFIG["tokenizer_name"], src_lang=CONFIG["source_lang"], tgt_lang=CONFIG["target_lang"] ) # Set up generation configuration khm_token_id = tokenizer.convert_tokens_to_ids(CONFIG["target_lang"]) generation_config = GenerationConfig( max_length=CONFIG["max_length"], forced_bos_token_id=khm_token_id ) # Translate text = "Hello, how are you?" inputs = tokenizer(text, return_tensors="pt") outputs = model.generate(**inputs, generation_config=generation_config) translation = tokenizer.decode(outputs, skip_special_tokens=True) print(translation) ``` ## Training Details ### Training Data - **Dataset size:** 316,110 English-Khmer sentence pairs - **Data source:** Synthetic data generated using DeepSeek translation API - **Preprocessing:** Tokenized using NLLB-200 tokenizer with max length 128 ### Training Hyperparameters - **Batch size:** 48 - **Learning rate:** 3e-5 - **Optimizer:** AdamW - **LR scheduler:** Cosine - **Training epochs:** 1 - **Hardware:** NVIDIA A100-SXM4-80GB with CUDA 12.8 ## Evaluation ### Testing Data The model was evaluated on the Asian Language Treebank (ALT) corpus, containing manually translated English-Khmer pairs. ### Metrics | Metric | Value | |--------|-------| | chrF Score | 21.3502 | | BERTScore F1 | 0.8983 | ### Results This proof-of-concept model demonstrates that knowledge distillation can achieve reasonable translation quality with significantly reduced parameters (350M vs 600M baseline). ## Limitations and Bias ### Limitations - **Limited training:** Only 1 epoch of training; performance may improve with extended training - **Synthetic data:** Training data generated via API may not capture all linguistic nuances - **Domain specificity:** Performance may vary across different text domains - **Resource constraints:** Optimized for efficiency over maximum quality ### Bias Considerations - Training data generated via translation API may inherit biases from the source model - Limited evaluation on diverse Khmer dialects and registers - Potential cultural and contextual biases in translation choices ## Citation @misc{nllb350m_en_km_v1_2025, title={NLLB-350M-EN-KM-v1: Proof of Concept English-Khmer Neural Machine Translation via Knowledge Distillation}, author={Chealyfey Vutha}, year={2025}, url={https://huggingface.co/lyfeyvutha/nllb_350M_en_km_v1} } ## Model Card Contact For questions or feedback about this model card: lyfeytech@gmail.com