Masoretic Hebrew to Targumic Aramaic (Onqelos & Jonathan) Translation Model

This model fine-tunes the Helsinki-NLP/opus-mt-sem-sem model for translation from the consonantal Hebrew Masoretic Text (Tanakh) to the consonantal Aramaic Targum of Onqelos (Torah) and Targum of Jonathan (Prophets). The model is specifically trained on parallel biblical texts, with both source and target in Hebrew script.

Model Details

Model Name: johnlockejrr/marianmt-he2arc-targum
Base Model: Helsinki-NLP/opus-mt-sem-sem
Language Pair: Masoretic Hebrew → Targumic Aramaic (he2arc)
Script: Both languages in Hebrew characters (consonantal)
Domain: Biblical texts (Torah and Prophets)
License: MIT

Overview

The model is based on the Helsinki-NLP/opus-mt-sem-sem model, designed for Semitic language translation. It is fine-tuned for the specific task of translating the consonantal Hebrew Masoretic Text to the consonantal Aramaic Targum of Onqelos (for the Torah) and Targum of Jonathan (for the Prophets), using a custom parallel corpus.

Key Features

Base Model: Helsinki-NLP/opus-mt-sem-sem
Direction: Masoretic Hebrew → Targumic Aramaic (he2arc)
Script: Both source and target in Hebrew script (consonantal)
Language Tokens: Uses >>heb<< for Hebrew and >>arc<< for Aramaic
Optimized: Configured for 12GB GPU with appropriate batch sizes and learning rates

Dataset

The training uses the hebrew_aramaic_dataset directory, which contains:

hebrew: Consonantal Hebrew Masoretic Text (Tanakh)
aramaic: Consonantal Aramaic Targum (Onqelos for Torah, Jonathan for Prophets)
Additional metadata: book, chapter, verse

Training Configuration

Batch Size: 8 (optimized for 12GB GPU)
Learning Rate: 1e-5
Epochs: 100
Warmup Steps: 1000
Gradient Accumulation: 4 (effective batch size: 32)
FP16: Enabled for faster training
Language Prefix: Uses >>heb<< and >>arc<< tokens

Performance Metrics

Final Training Loss: 0.136
Test Loss: 1.30
BLEU Score: 42.28
Character Accuracy: 34.11%
Vocabulary Size: 33,701
Model Parameters: 61,917,696

Usage

Inference

Single text translation:

python inference.py \
    --model_path ./hebrew_aramaic_model_improved \
    --text "בראשית ברא אלהים" \
    --direction he2arc

Using Hugging Face Hub:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "johnlockejrr/marianmt-he2arc-targum"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Translate Hebrew to Aramaic
text = "בראשית ברא אלהים"
inputs = tokenizer(f">>heb<< {text}", return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(**inputs, max_length=512, num_beams=4)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Translation: {translation}")

Batch translation from file:

python inference.py \
    --model_path ./hebrew_aramaic_model_improved \
    --input_file input_texts.txt \
    --output_file translations.txt \
    --direction he2arc

Interactive mode:

python inference.py --model_path ./hebrew_aramaic_model_improved

Model Information

The final trained model is saved in ./hebrew_aramaic_model_improved/ with:

model.safetensors: Model weights
tokenizer_config.json: Tokenizer configuration
model_info.json: Training information and metadata
training_args.bin: Training arguments
test_results.json: Final evaluation results
all_results.json: Complete training history

Language Tokens

The model uses the following language tokens:

>>heb<<: Hebrew (source language)
>>arc<<: Aramaic (target language)

These tokens are added to the tokenizer during training and used during inference to specify the translation direction.

Requirements

Python 3.8+
PyTorch
Transformers
Datasets
Evaluate (for BLEU calculation)
CUDA-compatible GPU (recommended)

Notes

The model is specifically optimized for Masoretic Hebrew to Targumic Aramaic translation
Both source and target are in Hebrew script (consonantal)
The base model (sem-sem) supports multiple Semitic languages
Training logs are saved to he2arc_training.log
The model shows significant improvement with continued training (100 epochs total)
Best performance achieved with learning rate reduction in continued training phase

Troubleshooting

Out of Memory: Reduce batch size or gradient accumulation steps
Poor Performance:
- Check dataset quality
- Consider continued training with lower learning rate
- Ensure proper language token usage (>>heb<< and >>arc<<)
Language Token Issues: Ensure >>arc<< token is properly added to tokenizer
Training Loss Not Decreasing: Try continued training with reduced learning rate (5e-6)
Mixed Language Output: This indicates under-training; continue training for more epochs

Citation

If you use this model, please cite:

@misc{marianmt-he2arc-targum,
  author = {John Locke Jr.},
  title = {Masoretic Hebrew to Targumic Aramaic (Onqelos & Jonathan) Translation Model},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face model repository},
  howpublished = {\url{https://huggingface.co/johnlockejrr/marianmt-he2arc-targum}},
}

Base Model: Helsinki-NLP/opus-mt-sem-sem
Dataset: Parallel corpus of consonantal Hebrew Masoretic Text and consonantal Aramaic Targum of Onqelos (Torah) and Jonathan (Prophets)

johnlockejrr
/

marianmt-he2arc-targum