--- license: mit language: - he - en base_model: - Helsinki-NLP/opus-mt-mul-en library_name: transformers tags: - Translation --- # Hebrew-English Translation Model A fine-tuned MarianMT model for translating between Hebrew and English, specifically trained on biblical text from the New World Translation of the Holy Scriptures. ## Model Description - **Model type:** MarianMT (Seq2Seq) - **Language:** Hebrew ↔ English - **Base model:** Helsinki-NLP/opus-mt-mul-en - **Training data:** New World Translation of the Holy Scriptures (Modern Hebrew translation) - **BLEU Score:** 46.69 (test set) - **Character Accuracy:** 26.57% ## Dataset Information The model was trained on the New World Translation of the Holy Scriptures dataset, which contains: - **Source:** Modern Hebrew translation (not the original Biblical Hebrew) - **Target:** English translation - **Dataset size:** 30,693 training examples, 3,837 validation examples, 3,837 test examples - **Text type:** Biblical scripture with religious terminology ## Training Details - **Training epochs:** 5.34 (early stopping) - **Learning rate:** 2e-5 - **Batch size:** 8 - **Mixed precision:** FP16 - **Early stopping:** Enabled with patience=3 - **Training time:** ~3.5 hours - **Hardware:** GPU training ## Usage ### Using the Model ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM # Load model and tokenizer model_name = "johnlockejrr/marianmt-he2en-nwt" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) # Hebrew to English translation hebrew_text = "שלום עולם" inputs = tokenizer(hebrew_text, return_tensors="pt", padding=True) outputs = model.generate(**inputs, max_length=128, num_beams=4) english_translation = tokenizer.decode(outputs[0], skip_special_tokens=True) print(english_translation) ``` ### Using the Pipeline ```python from transformers import pipeline translator = pipeline("translation", model="johnlockejrr/marianmt-he2en-nwt") # Hebrew to English hebrew_text = "בראשית ברא אלהים את השמים ואת הארץ" result = translator(hebrew_text) print(result[0]['translation_text']) ``` ### Interactive Translation ```bash python inference.py --model_path ./hebrew_english_model --text "שלום עולם" --direction he2en ``` ## Model Performance ### Evaluation Metrics - **BLEU Score:** 46.69 (test set) - **Character Accuracy:** 26.57% - **Training Loss:** 1.07 - **Validation Loss:** 1.11 ### Translation Examples | Hebrew | English Translation | |--------|-------------------| | שלום עולם | Hello world | | בראשית ברא אלהים | In the beginning God created | | אהבה | Love | ## Limitations 1. **Domain Specificity:** This model is specifically trained on biblical text and may perform best on religious/scriptural content 2. **Modern Hebrew:** The Hebrew text is Modern Hebrew translation, not original Biblical Hebrew 3. **Context Sensitivity:** Translation quality may vary depending on the context and complexity of the text 4. **Cultural Nuances:** Some cultural and religious nuances may not be perfectly captured ## Training Configuration ```python training_args = Seq2SeqTrainingArguments( output_dir="./hebrew_english_model", eval_strategy="steps", eval_steps=500, save_strategy="steps", save_steps=500, learning_rate=2e-5, per_device_train_batch_size=8, per_device_eval_batch_size=8, weight_decay=0.01, save_total_limit=3, num_train_epochs=20, predict_with_generate=True, fp16=True, load_best_model_at_end=True, metric_for_best_model="bleu", greater_is_better=True, early_stopping_patience=3 ) ``` ## Dataset Preparation The dataset was prepared from the New World Translation corpus with the following preprocessing: - Text cleaning and normalization - Length filtering (5-1000 characters) - Length ratio filtering (0.3-3.0) - Train/validation/test split (80/10/10) ## Citation If you use this model in your research, please cite: ```bibtex @misc{hebrew_english_translation_2025, title={Hebrew-English Translation Model}, author={johnlockejrr}, year={2025}, url={https://huggingface.co/johnlockejrr/marianmt-he2en-nwt} } ``` ## License This model is released under the same license as the base model (MarianMT) and the training dataset. ## Acknowledgments - Base model: Helsinki-NLP/opus-mt-mul-en - Dataset: New World Translation of the Holy Scriptures - Training framework: Hugging Face Transformers ## Contact For questions or issues, please open an issue on the Hugging Face model page. --- **Note:** This model is specifically designed for biblical text translation and may not perform optimally on general Hebrew-English translation tasks.