nllb_350M_en_km_v1 / README.md

Update README.md

32a505e verified 4 months ago

5.05 kB

	---
	language:
	- en
	- km
	license: cc-by-nc-4.0
	base_model: facebook/nllb-200-distilled-600M
	tags:
	- translation
	- knowledge-distillation
	- nllb
	- english
	- khmer
	- seq2seq
	datasets:
	- mutiyama/alt
	metrics:
	- chrf
	- bertscore
	model-index:
	- name: nllb_350M_en_km_v1
	results:
	- task:
	type: translation
	name: Machine Translation
	dataset:
	name: Asian Language Treebank (ALT)
	type: mutiyama/alt
	metrics:
	- type: chrf
	value: 21.3502
	- type: bertscore
	value: 0.8983
	pipeline_tag: translation
	new_version: lyfeyvutha/nllb_350M_en_km_v10
	---

	# NLLB-350M-EN-KM-v1

	## Model Description

	This model is a compact English-to-Khmer neural machine translation model created through knowledge distillation from NLLB-200. This is the proof-of-concept version (1 epoch) demonstrating the feasibility of the distillation approach.

	- Developed by: Chealyfey Vutha
	- Model type: Sequence-to-sequence transformer for machine translation
	- Language(s): English to Khmer (en → km)
	- License: CC-BY-NC 4.0
	- Base model: facebook/nllb-200-distilled-600M
	- Teacher model: facebook/nllb-200-1.3B
	- Parameters: 350M (42% reduction from 600M baseline)

	## Model Details

	### Architecture
	- Encoder layers: 3 (reduced from 12)
	- Decoder layers: 3 (reduced from 12)
	- Hidden size: 1024
	- Attention heads: 16
	- Total parameters: ~350M

	### Training Procedure
	- Distillation method: Temperature-scaled knowledge distillation
	- Teacher model: NLLB-200-1.3B
	- Temperature: 5.0
	- Lambda (loss weighting): 0.5
	- Training epochs: 1 (proof of concept)
	- Training data: 316,110 English-Khmer pairs (generated via DeepSeek API)
	- Hardware: NVIDIA A100-SXM4-80GB

	## Intended Uses

	### Direct Use
	This model is intended for:
	- English-to-Khmer translation tasks
	- Research on knowledge distillation for low-resource languages
	- Proof-of-concept demonstrations
	- Computational efficiency research

	### Downstream Use
	- Integration into translation applications
	- Fine-tuning for domain-specific translation
	- Baseline for further model compression research

	## How to Get Started with the Model

	```python

	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig

	# Configuration
	CONFIG = {
	"model_name": "lyfeyvutha/nllb_350M_en_km_v10",
	"tokenizer_name": "facebook/nllb-200-distilled-600M",
	"source_lang": "eng_Latn",
	"target_lang": "khm_Khmr",
	"max_length": 128
	}

	# Load model and tokenizer
	model = AutoModelForSeq2SeqLM.from_pretrained(CONFIG["model_name"])
	tokenizer = AutoTokenizer.from_pretrained(
	CONFIG["tokenizer_name"],
	src_lang=CONFIG["source_lang"],
	tgt_lang=CONFIG["target_lang"]
	)

	# Set up generation configuration
	khm_token_id = tokenizer.convert_tokens_to_ids(CONFIG["target_lang"])
	generation_config = GenerationConfig(
	max_length=CONFIG["max_length"],
	forced_bos_token_id=khm_token_id
	)

	# Translate
	text = "Hello, how are you?"
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model.generate(**inputs, generation_config=generation_config)
	translation = tokenizer.decode(outputs, skip_special_tokens=True)
	print(translation)

	```

	## Training Details

	### Training Data

	- Dataset size: 316,110 English-Khmer sentence pairs
	- Data source: Synthetic data generated using DeepSeek translation API
	- Preprocessing: Tokenized using NLLB-200 tokenizer with max length 128

	### Training Hyperparameters
	- Batch size: 48
	- Learning rate: 3e-5
	- Optimizer: AdamW
	- LR scheduler: Cosine
	- Training epochs: 1
	- Hardware: NVIDIA A100-SXM4-80GB with CUDA 12.8

	## Evaluation

	### Testing Data
	The model was evaluated on the Asian Language Treebank (ALT) corpus, containing manually translated English-Khmer pairs.

	### Metrics
	\| Metric \| Value \|
	\|--------\|-------\|
	\| chrF Score \| 21.3502 \|
	\| BERTScore F1 \| 0.8983 \|

	### Results
	This proof-of-concept model demonstrates that knowledge distillation can achieve reasonable translation quality with significantly reduced parameters (350M vs 600M baseline).

	## Limitations and Bias

	### Limitations
	- Limited training: Only 1 epoch of training; performance may improve with extended training
	- Synthetic data: Training data generated via API may not capture all linguistic nuances
	- Domain specificity: Performance may vary across different text domains
	- Resource constraints: Optimized for efficiency over maximum quality

	### Bias Considerations
	- Training data generated via translation API may inherit biases from the source model
	- Limited evaluation on diverse Khmer dialects and registers
	- Potential cultural and contextual biases in translation choices

	## Citation

	@misc{nllb350m_en_km_v1_2025,
	title={NLLB-350M-EN-KM-v1: Proof of Concept English-Khmer Neural Machine Translation via Knowledge Distillation},
	author={Chealyfey Vutha},
	year={2025},
	url={https://huggingface.co/lyfeyvutha/nllb_350M_en_km_v1}
	}

	## Model Card Contact

	For questions or feedback about this model card: [email protected]