lyfeyvutha
/

nllb_350M_en_km_v1

knowledge-distillation

Model card Files Files and versions

lyfeyvutha commited on May 23

Commit

eeb6308

·

verified ·

1 Parent(s): 8c54dab

Update README.md

Files changed (1) hide show

README.md +5 -4

README.md CHANGED Viewed

@@ -86,7 +86,7 @@ This model is intended for:
 from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig
-Configuration
 CONFIG = {
 "model_name": "lyfeyvutha/nllb_350M_en_km_v10",
 "tokenizer_name": "facebook/nllb-200-distilled-600M",
@@ -95,7 +95,7 @@ CONFIG = {
 "max_length": 128
 }
-Load model and tokenizer
 model = AutoModelForSeq2SeqLM.from_pretrained(CONFIG["model_name"])
 tokenizer = AutoTokenizer.from_pretrained(
 CONFIG["tokenizer_name"],
@@ -103,14 +103,14 @@ src_lang=CONFIG["source_lang"],
 tgt_lang=CONFIG["target_lang"]
 )
-Set up generation configuration
 khm_token_id = tokenizer.convert_tokens_to_ids(CONFIG["target_lang"])
 generation_config = GenerationConfig(
 max_length=CONFIG["max_length"],
 forced_bos_token_id=khm_token_id
 )
-Translate
 text = "Hello, how are you?"
 inputs = tokenizer(text, return_tensors="pt")
 outputs = model.generate(**inputs, generation_config=generation_config)
@@ -122,6 +122,7 @@ print(translation)
 ## Training Details
 ### Training Data
 - **Dataset size:** 316,110 English-Khmer sentence pairs
 - **Data source:** Synthetic data generated using DeepSeek translation API
 - **Preprocessing:** Tokenized using NLLB-200 tokenizer with max length 128

 from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig
+# Configuration
 CONFIG = {
 "model_name": "lyfeyvutha/nllb_350M_en_km_v10",
 "tokenizer_name": "facebook/nllb-200-distilled-600M",
 "max_length": 128
 }
+# Load model and tokenizer
 model = AutoModelForSeq2SeqLM.from_pretrained(CONFIG["model_name"])
 tokenizer = AutoTokenizer.from_pretrained(
 CONFIG["tokenizer_name"],
 tgt_lang=CONFIG["target_lang"]
 )
+# Set up generation configuration
 khm_token_id = tokenizer.convert_tokens_to_ids(CONFIG["target_lang"])
 generation_config = GenerationConfig(
 max_length=CONFIG["max_length"],
 forced_bos_token_id=khm_token_id
 )
+# Translate
 text = "Hello, how are you?"
 inputs = tokenizer(text, return_tensors="pt")
 outputs = model.generate(**inputs, generation_config=generation_config)
 ## Training Details
 ### Training Data
 - **Dataset size:** 316,110 English-Khmer sentence pairs
 - **Data source:** Synthetic data generated using DeepSeek translation API
 - **Preprocessing:** Tokenized using NLLB-200 tokenizer with max length 128