Masoretic Hebrew (with nikkud) to Targumic Aramaic (with nikkud) MarianMT Model

This model fine-tunes the Helsinki-NLP/opus-mt-sem-sem model for translation from the vocalized (with nikkud, no cantillation) Masoretic Hebrew Tanakh to the vocalized Targumic Aramaic (Onqelos for Torah, Jonathan for Prophets). The model is trained on a verse-aligned parallel corpus, both source and target in Hebrew script with nikkud.

Model Details

  • Model Name: hebrew_aramaic_model_improved (local directory)
  • Base Model: Helsinki-NLP/opus-mt-sem-sem
  • Language Pair: Masoretic Hebrew (with nikkud) → Targumic Aramaic (with nikkud)
  • Script: Both languages in Hebrew characters (vocalized, no cantillation)
  • Domain: Biblical texts (Tanakh, Torah, Prophets)
  • License: MIT

Dataset

  • Hebrew Source: Vocalized Masoretic Hebrew Tanakh (with nikkud, no cantillation)
  • Aramaic Target: Vocalized Targumic Aramaic (Onqelos for Torah, Jonathan for Prophets)
  • Alignment: Verse-aligned, covering the entire Tanakh

Training Configuration

  • Base Model: Helsinki-NLP/opus-mt-sem-sem
  • Batch Size: 8 (per device, gradient accumulation 4)
  • Learning Rate: 1e-5
  • Epochs: 100
  • FP16: Enabled
  • Language Prefix: Uses >>heb<< for Hebrew and >>arc<< for Aramaic
  • Tokenizer: MarianMT tokenizer with added special tokens for language direction
  • Max Input/Target Length: 512
  • Eval Steps: 500
  • Save Steps: 500
  • Warmup Steps: 1000
  • Seed: 42

Performance Metrics

  • Final Training Loss: 0.8182
  • Test Loss: 0.5223
  • BLEU Score: 36.99
  • Character Accuracy: 27.05%
  • Vocabulary Size: 33,701
  • Model Parameters: 61,917,696

Usage

Inference

  1. Single text translation:

    python inference.py \
        --model_path ./hebrew_aramaic_model_improved \
        --text "בְּרֵאשִׁית בָּרָא אֱלֹהִים אֵת הַשָּׁמַיִם וְאֵת הָאָרֶץ" \
        --direction he2arc
    
  2. Using Hugging Face Transformers (local model):

    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
    
    model_path = "./hebrew_aramaic_model_improved"
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
    
    # Translate Hebrew to Aramaic
    text = "בְּרֵאשִׁית בָּרָא אֱלֹהִים אֵת הַשָּׁמַיִם וְאֵת הָאָרֶץ"
    inputs = tokenizer(f">>heb<< {text}", return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(**inputs, max_length=512, num_beams=4)
    translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"Aramaic: {translation}")
    
  3. Batch translation from file:

    python inference.py \
        --model_path ./hebrew_aramaic_model_improved \
        --input_file input_texts.txt \
        --output_file translations.txt \
        --direction he2arc
    
  4. Interactive mode:

    python inference.py --model_path ./hebrew_aramaic_model_improved
    

Model Information

The final trained model is saved in ./hebrew_aramaic_model_improved/ with:

  • model.safetensors: Model weights
  • tokenizer_config.json: Tokenizer configuration
  • model_info.json: Training information and metadata
  • training_args.bin: Training arguments
  • test_results.json: Final evaluation results
  • all_results.json: Complete training history

Language Tokens

The model uses the following language tokens:

  • >>heb<<: Hebrew (source language)
  • >>arc<<: Aramaic (target language)

These tokens are added to the tokenizer during training and used during inference to specify the translation direction.

Requirements

  • Python 3.8+
  • PyTorch
  • Transformers
  • Datasets
  • Evaluate (for BLEU calculation)
  • CUDA-compatible GPU (recommended)

Notes

  • The model is specifically optimized for Masoretic Hebrew (with nikkud, no cantillation) to Targumic Aramaic (with nikkud) translation
  • Both source and target are in Hebrew script (vocalized, no cantillation)
  • The base model (sem-sem) supports multiple Semitic languages
  • Training logs are saved to he2arc_training.log
  • The model shows significant improvement with continued training (100 epochs total)
  • Best performance achieved with learning rate reduction in continued training phase

Troubleshooting

  1. Out of Memory: Reduce batch size or gradient accumulation steps
  2. Poor Performance:
    • Check dataset quality
    • Consider continued training with lower learning rate
    • Ensure proper language token usage (>>heb<< and >>arc<<)
  3. Language Token Issues: Ensure >>arc<< token is properly added to tokenizer
  4. Training Loss Not Decreasing: Try continued training with reduced learning rate (5e-6)
  5. Mixed Language Output: This indicates under-training; continue training for more epochs

Citation

If you use this model, please cite:

@misc{marianmt-he2arc-targum-voc,
  author = {John Locke Jr.},
  title = {Masoretic Hebrew to Targumic Aramaic (Onqelos & Jonathan) MarianMT Model},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face model repository},
  howpublished = {\url{https://huggingface.co/johnlockejrr/marianmt-he2arc-targum-voc}},
}

Base Model: Helsinki-NLP/opus-mt-sem-sem
Dataset: Parallel corpus of vocalized Masoretic Hebrew Tanakh and vocalized Targumic Aramaic (Onqelos for Torah, Jonathan for Prophets)

Downloads last month
13
Safetensors
Model size
61.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for johnlockejrr/marianmt-he2arc-targum-voc

Finetuned
(4)
this model

Space using johnlockejrr/marianmt-he2arc-targum-voc 1