Masoretic Hebrew (with nikkud) to Targumic Aramaic (with nikkud) MarianMT Model

This model fine-tunes the Helsinki-NLP/opus-mt-sem-sem model for translation from the vocalized (with nikkud, no cantillation) Masoretic Hebrew Tanakh to the vocalized Targumic Aramaic (Onqelos for Torah, Jonathan for Prophets). The model is trained on a verse-aligned parallel corpus, both source and target in Hebrew script with nikkud.

Model Details

Model Name: hebrew_aramaic_model_improved (local directory)
Base Model: Helsinki-NLP/opus-mt-sem-sem
Language Pair: Masoretic Hebrew (with nikkud) → Targumic Aramaic (with nikkud)
Script: Both languages in Hebrew characters (vocalized, no cantillation)
Domain: Biblical texts (Tanakh, Torah, Prophets)
License: MIT

Dataset

Hebrew Source: Vocalized Masoretic Hebrew Tanakh (with nikkud, no cantillation)
Aramaic Target: Vocalized Targumic Aramaic (Onqelos for Torah, Jonathan for Prophets)
Alignment: Verse-aligned, covering the entire Tanakh

Training Configuration

Base Model: Helsinki-NLP/opus-mt-sem-sem
Batch Size: 8 (per device, gradient accumulation 4)
Learning Rate: 1e-5
Epochs: 100
FP16: Enabled
Language Prefix: Uses >>heb<< for Hebrew and >>arc<< for Aramaic
Tokenizer: MarianMT tokenizer with added special tokens for language direction
Max Input/Target Length: 512
Eval Steps: 500
Save Steps: 500
Warmup Steps: 1000
Seed: 42

Performance Metrics

Final Training Loss: 0.8182
Test Loss: 0.5223
BLEU Score: 36.99
Character Accuracy: 27.05%
Vocabulary Size: 33,701
Model Parameters: 61,917,696

Usage

Inference

Single text translation:

python inference.py \
    --model_path ./hebrew_aramaic_model_improved \
    --text "בְּרֵאשִׁית בָּרָא אֱלֹהִים אֵת הַשָּׁמַיִם וְאֵת הָאָרֶץ" \
    --direction he2arc

Using Hugging Face Transformers (local model):

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_path = "./hebrew_aramaic_model_improved"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

# Translate Hebrew to Aramaic
text = "בְּרֵאשִׁית בָּרָא אֱלֹהִים אֵת הַשָּׁמַיִם וְאֵת הָאָרֶץ"
inputs = tokenizer(f">>heb<< {text}", return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(**inputs, max_length=512, num_beams=4)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Aramaic: {translation}")

Batch translation from file:

python inference.py \
    --model_path ./hebrew_aramaic_model_improved \
    --input_file input_texts.txt \
    --output_file translations.txt \
    --direction he2arc

Interactive mode:

python inference.py --model_path ./hebrew_aramaic_model_improved

Model Information

The final trained model is saved in ./hebrew_aramaic_model_improved/ with:

model.safetensors: Model weights
tokenizer_config.json: Tokenizer configuration
model_info.json: Training information and metadata
training_args.bin: Training arguments
test_results.json: Final evaluation results
all_results.json: Complete training history

Language Tokens

The model uses the following language tokens:

>>heb<<: Hebrew (source language)
>>arc<<: Aramaic (target language)

These tokens are added to the tokenizer during training and used during inference to specify the translation direction.

Requirements

Python 3.8+
PyTorch
Transformers
Datasets
Evaluate (for BLEU calculation)
CUDA-compatible GPU (recommended)

Notes

The model is specifically optimized for Masoretic Hebrew (with nikkud, no cantillation) to Targumic Aramaic (with nikkud) translation
Both source and target are in Hebrew script (vocalized, no cantillation)
The base model (sem-sem) supports multiple Semitic languages
Training logs are saved to he2arc_training.log
The model shows significant improvement with continued training (100 epochs total)
Best performance achieved with learning rate reduction in continued training phase

Troubleshooting

Out of Memory: Reduce batch size or gradient accumulation steps
Poor Performance:
- Check dataset quality
- Consider continued training with lower learning rate
- Ensure proper language token usage (>>heb<< and >>arc<<)
Language Token Issues: Ensure >>arc<< token is properly added to tokenizer
Training Loss Not Decreasing: Try continued training with reduced learning rate (5e-6)
Mixed Language Output: This indicates under-training; continue training for more epochs

Citation

If you use this model, please cite:

@misc{marianmt-he2arc-targum-voc,
  author = {John Locke Jr.},
  title = {Masoretic Hebrew to Targumic Aramaic (Onqelos & Jonathan) MarianMT Model},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face model repository},
  howpublished = {\url{https://huggingface.co/johnlockejrr/marianmt-he2arc-targum-voc}},
}

Base Model: Helsinki-NLP/opus-mt-sem-sem
Dataset: Parallel corpus of vocalized Masoretic Hebrew Tanakh and vocalized Targumic Aramaic (Onqelos for Torah, Jonathan for Prophets)

johnlockejrr
/

marianmt-he2arc-targum-voc