Hebrew-English Translation Model

A fine-tuned MarianMT model for translating between Hebrew and English, specifically trained on biblical text from the New World Translation of the Holy Scriptures.

Model Description

  • Model type: MarianMT (Seq2Seq)
  • Language: Hebrew โ†” English
  • Base model: Helsinki-NLP/opus-mt-mul-en
  • Training data: New World Translation of the Holy Scriptures (Modern Hebrew translation)
  • BLEU Score: 46.69 (test set)
  • Character Accuracy: 26.57%

Dataset Information

The model was trained on the New World Translation of the Holy Scriptures dataset, which contains:

  • Source: Modern Hebrew translation (not the original Biblical Hebrew)
  • Target: English translation
  • Dataset size: 30,693 training examples, 3,837 validation examples, 3,837 test examples
  • Text type: Biblical scripture with religious terminology

Training Details

  • Training epochs: 5.34 (early stopping)
  • Learning rate: 2e-5
  • Batch size: 8
  • Mixed precision: FP16
  • Early stopping: Enabled with patience=3
  • Training time: ~3.5 hours
  • Hardware: GPU training

Usage

Using the Model

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
model_name =  "johnlockejrr/marianmt-he2en-nwt"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Hebrew to English translation
hebrew_text = "ืฉืœื•ื ืขื•ืœื"
inputs = tokenizer(hebrew_text, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_length=128, num_beams=4)
english_translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(english_translation)

Using the Pipeline

from transformers import pipeline

translator = pipeline("translation", model="johnlockejrr/marianmt-he2en-nwt")

# Hebrew to English
hebrew_text = "ื‘ืจืืฉื™ืช ื‘ืจื ืืœื”ื™ื ืืช ื”ืฉืžื™ื ื•ืืช ื”ืืจืฅ"
result = translator(hebrew_text)
print(result[0]['translation_text'])

Interactive Translation

python inference.py --model_path ./hebrew_english_model --text "ืฉืœื•ื ืขื•ืœื" --direction he2en

Model Performance

Evaluation Metrics

  • BLEU Score: 46.69 (test set)
  • Character Accuracy: 26.57%
  • Training Loss: 1.07
  • Validation Loss: 1.11

Translation Examples

Hebrew English Translation
ืฉืœื•ื ืขื•ืœื Hello world
ื‘ืจืืฉื™ืช ื‘ืจื ืืœื”ื™ื In the beginning God created
ืื”ื‘ื” Love

Limitations

  1. Domain Specificity: This model is specifically trained on biblical text and may perform best on religious/scriptural content
  2. Modern Hebrew: The Hebrew text is Modern Hebrew translation, not original Biblical Hebrew
  3. Context Sensitivity: Translation quality may vary depending on the context and complexity of the text
  4. Cultural Nuances: Some cultural and religious nuances may not be perfectly captured

Training Configuration

training_args = Seq2SeqTrainingArguments(
    output_dir="./hebrew_english_model",
    eval_strategy="steps",
    eval_steps=500,
    save_strategy="steps",
    save_steps=500,
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=20,
    predict_with_generate=True,
    fp16=True,
    load_best_model_at_end=True,
    metric_for_best_model="bleu",
    greater_is_better=True,
    early_stopping_patience=3
)

Dataset Preparation

The dataset was prepared from the New World Translation corpus with the following preprocessing:

  • Text cleaning and normalization
  • Length filtering (5-1000 characters)
  • Length ratio filtering (0.3-3.0)
  • Train/validation/test split (80/10/10)

Citation

If you use this model in your research, please cite:

@misc{hebrew_english_translation_2025,
  title={Hebrew-English Translation Model},
  author={johnlockejrr},
  year={2025},
  url={https://huggingface.co/johnlockejrr/marianmt-he2en-nwt}
}

License

This model is released under the same license as the base model (MarianMT) and the training dataset.

Acknowledgments

  • Base model: Helsinki-NLP/opus-mt-mul-en
  • Dataset: New World Translation of the Holy Scriptures
  • Training framework: Hugging Face Transformers

Contact

For questions or issues, please open an issue on the Hugging Face model page.


Note: This model is specifically designed for biblical text translation and may not perform optimally on general Hebrew-English translation tasks.

Downloads last month
14
Safetensors
Model size
77.1M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for johnlockejrr/marianmt-he2en-nwt

Finetuned
(15)
this model