Masoretic Hebrew (with nikkud) to Targumic Aramaic (with nikkud) MarianMT Model
This model fine-tunes the Helsinki-NLP/opus-mt-sem-sem model for translation from the vocalized (with nikkud, no cantillation) Masoretic Hebrew Tanakh to the vocalized Targumic Aramaic (Onqelos for Torah, Jonathan for Prophets). The model is trained on a verse-aligned parallel corpus, both source and target in Hebrew script with nikkud.
Model Details
- Model Name:
hebrew_aramaic_model_improved
(local directory) - Base Model: Helsinki-NLP/opus-mt-sem-sem
- Language Pair: Masoretic Hebrew (with nikkud) → Targumic Aramaic (with nikkud)
- Script: Both languages in Hebrew characters (vocalized, no cantillation)
- Domain: Biblical texts (Tanakh, Torah, Prophets)
- License: MIT
Dataset
- Hebrew Source: Vocalized Masoretic Hebrew Tanakh (with nikkud, no cantillation)
- Aramaic Target: Vocalized Targumic Aramaic (Onqelos for Torah, Jonathan for Prophets)
- Alignment: Verse-aligned, covering the entire Tanakh
Training Configuration
- Base Model: Helsinki-NLP/opus-mt-sem-sem
- Batch Size: 8 (per device, gradient accumulation 4)
- Learning Rate: 1e-5
- Epochs: 100
- FP16: Enabled
- Language Prefix: Uses
>>heb<<
for Hebrew and>>arc<<
for Aramaic - Tokenizer: MarianMT tokenizer with added special tokens for language direction
- Max Input/Target Length: 512
- Eval Steps: 500
- Save Steps: 500
- Warmup Steps: 1000
- Seed: 42
Performance Metrics
- Final Training Loss: 0.8182
- Test Loss: 0.5223
- BLEU Score: 36.99
- Character Accuracy: 27.05%
- Vocabulary Size: 33,701
- Model Parameters: 61,917,696
Usage
Inference
Single text translation:
python inference.py \ --model_path ./hebrew_aramaic_model_improved \ --text "בְּרֵאשִׁית בָּרָא אֱלֹהִים אֵת הַשָּׁמַיִם וְאֵת הָאָרֶץ" \ --direction he2arc
Using Hugging Face Transformers (local model):
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM model_path = "./hebrew_aramaic_model_improved" tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForSeq2SeqLM.from_pretrained(model_path) # Translate Hebrew to Aramaic text = "בְּרֵאשִׁית בָּרָא אֱלֹהִים אֵת הַשָּׁמַיִם וְאֵת הָאָרֶץ" inputs = tokenizer(f">>heb<< {text}", return_tensors="pt", max_length=512, truncation=True) outputs = model.generate(**inputs, max_length=512, num_beams=4) translation = tokenizer.decode(outputs[0], skip_special_tokens=True) print(f"Aramaic: {translation}")
Batch translation from file:
python inference.py \ --model_path ./hebrew_aramaic_model_improved \ --input_file input_texts.txt \ --output_file translations.txt \ --direction he2arc
Interactive mode:
python inference.py --model_path ./hebrew_aramaic_model_improved
Model Information
The final trained model is saved in ./hebrew_aramaic_model_improved/
with:
model.safetensors
: Model weightstokenizer_config.json
: Tokenizer configurationmodel_info.json
: Training information and metadatatraining_args.bin
: Training argumentstest_results.json
: Final evaluation resultsall_results.json
: Complete training history
Language Tokens
The model uses the following language tokens:
>>heb<<
: Hebrew (source language)>>arc<<
: Aramaic (target language)
These tokens are added to the tokenizer during training and used during inference to specify the translation direction.
Requirements
- Python 3.8+
- PyTorch
- Transformers
- Datasets
- Evaluate (for BLEU calculation)
- CUDA-compatible GPU (recommended)
Notes
- The model is specifically optimized for Masoretic Hebrew (with nikkud, no cantillation) to Targumic Aramaic (with nikkud) translation
- Both source and target are in Hebrew script (vocalized, no cantillation)
- The base model (sem-sem) supports multiple Semitic languages
- Training logs are saved to
he2arc_training.log
- The model shows significant improvement with continued training (100 epochs total)
- Best performance achieved with learning rate reduction in continued training phase
Troubleshooting
- Out of Memory: Reduce batch size or gradient accumulation steps
- Poor Performance:
- Check dataset quality
- Consider continued training with lower learning rate
- Ensure proper language token usage (
>>heb<<
and>>arc<<
)
- Language Token Issues: Ensure
>>arc<<
token is properly added to tokenizer - Training Loss Not Decreasing: Try continued training with reduced learning rate (5e-6)
- Mixed Language Output: This indicates under-training; continue training for more epochs
Citation
If you use this model, please cite:
@misc{marianmt-he2arc-targum-voc,
author = {John Locke Jr.},
title = {Masoretic Hebrew to Targumic Aramaic (Onqelos & Jonathan) MarianMT Model},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face model repository},
howpublished = {\url{https://huggingface.co/johnlockejrr/marianmt-he2arc-targum-voc}},
}
Base Model: Helsinki-NLP/opus-mt-sem-sem
Dataset: Parallel corpus of vocalized Masoretic Hebrew Tanakh and vocalized Targumic Aramaic (Onqelos for Torah, Jonathan for Prophets)
- Downloads last month
- 13
Model tree for johnlockejrr/marianmt-he2arc-targum-voc
Base model
Helsinki-NLP/opus-mt-sem-sem