Samaritan Hebrew to Samaritan Targumic Aramaic Translation Model
This model fine-tunes the Helsinki-NLP/opus-mt-sem-sem model for Samaritan Hebrew to Samaritan Targumic Aramaic translation, where Aramaic is written in Hebrew script. The model is specifically trained on biblical texts of the Samaritan Torah extant in the BL Or. 7562 manuscript.
Model Details
- Model Name:
johnlockejrr/marianmt-he2arc-sam
- Base Model: Helsinki-NLP/opus-mt-sem-sem
- Language Pair: Samaritan Hebrew โ Samaritan Targumic Aramaic (he2arc)
- Script: Aramaic written in Hebrew characters
- Domain: Biblical and religious texts
- License: MIT
Overview
The model is based on the Helsinki-NLP/opus-mt-sem-sem model, which is designed for Semitic language translation. We fine-tune it specifically for Samaritan Hebrew to Samaritan Targumic Aramaic translation using a custom limited dataset of biblical texts.
Key Features
- Base Model: Helsinki-NLP/opus-mt-sem-sem
- Direction: Samaritan Hebrew โ Samaritan Targumic Aramaic (he2arc)
- Script: Aramaic written in Hebrew characters
- Language Tokens: Uses
>>heb<<
for Hebrew and>>arc<<
for Aramaic - Optimized: Configured for 12GB GPU with appropriate batch sizes and learning rates
Dataset
The training uses the hebrew_aramaic_dataset
directory which contains:
hebrew
: Hebrew source textaramaic
: Aramaic target text (in Hebrew script)- Additional metadata: book, chapter, verse
Training Configuration
Initial Training (20 epochs)
- Batch Size: 8 (optimized for 12GB GPU)
- Learning Rate: 1e-5
- Epochs: 20
- Warmup Steps: 1000
- Gradient Accumulation: 4 (effective batch size: 32)
- FP16: Enabled for faster training
- Language Prefix: Uses
>>heb<<
and>>arc<<
tokens
Continued Training (30 additional epochs)
- Learning Rate: 5e-6 (lower for fine-tuning)
- Epochs: 30 additional (50 total)
- Warmup Steps: 500 (less for continued training)
- Early Stopping: 5 patience (more patience)
Performance Metrics
Final Results (50 epochs total)
- BLEU Score: 48.14
- Training Loss: 0.96
- Test Loss: 1.02
- Character Accuracy: 41.90%
- Vocabulary Size: 33,701
- Model Parameters: 61,917,696
Training Progress
Metric | Initial (20 epochs) | Final (50 epochs) | Improvement |
---|---|---|---|
Training Loss | 4.36 | 0.96 | 78% reduction |
BLEU Score | 40.32 | 48.14 | +7.82 points |
Char Accuracy | 36.95% | 41.90% | +4.95% |
Usage
Training
Initial Training:
cd he2arc ./run_he2arc_improved.sh
Continued Training (recommended):
cd he2arc ./run_continued_training.sh
Or manually:
python continue_training.py \ --dataset_path ../hebrew_aramaic_dataset \ --checkpoint_path ./hebrew_aramaic_model_improved \ --output_dir ./hebrew_aramaic_model_continued \ --learning_rate 5e-6 \ --num_epochs 30 \ --use_fp16
Inference
Single text translation:
python inference.py \ --model_path ./hebrew_aramaic_model_continued \ --text "ืฉืืื ืขืืื" \ --direction he2arc
Using Hugging Face Hub:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM model_name = "johnlockejrr/marianmt-he2arc-sam" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) # Translate Hebrew to Aramaic text = "ืฉืืื ืขืืื" inputs = tokenizer(f">>heb<< {text}", return_tensors="pt", max_length=512, truncation=True) outputs = model.generate(**inputs, max_length=512, num_beams=4) translation = tokenizer.decode(outputs[0], skip_special_tokens=True) print(f"Translation: {translation}")
Batch translation from file:
python inference.py \ --model_path ./hebrew_aramaic_model_continued \ --input_file input_texts.txt \ --output_file translations.txt \ --direction he2arc
Interactive mode:
python inference.py --model_path ./hebrew_aramaic_model_continued
Model Information
The final trained model is saved in ./hebrew_aramaic_model_continued/
with:
model.safetensors
: Model weightstokenizer_config.json
: Tokenizer configurationmodel_info.json
: Training information and metadatatraining_args.bin
: Training argumentstest_results.json
: Final evaluation resultsall_results.json
: Complete training history
Language Tokens
The model uses the following language tokens:
>>heb<<
: Hebrew (source language)>>arc<<
: Aramaic (target language)
These tokens are added to the tokenizer during training and used during inference to specify the translation direction.
Performance
Final Performance Metrics
- BLEU Score: 48.14 (good for HebrewโAramaic translation)
- Character Accuracy: 41.90% (script-level accuracy)
- Training Loss: 0.96 (excellent, indicates good learning)
- Test Loss: 1.02 (good generalization)
Training Efficiency
- Total Training Time: ~32 minutes (20 + 12 minutes)
- Samples per Second: 114.5
- Steps per Second: 3.58
- Effective Batch Size: 32 (8 ร 4 gradient accumulation)
Requirements
- Python 3.8+
- PyTorch
- Transformers
- Datasets
- Evaluate (for BLEU calculation)
- CUDA-compatible GPU (recommended)
Notes
- The model is specifically optimized for Hebrew to Aramaic translation
- Aramaic text is expected to be in Hebrew script
- The base model (sem-sem) supports multiple Semitic languages
- Training logs are saved to
he2arc_training.log
andhe2arc_continued_training.log
- The model shows significant improvement with continued training (50 epochs total)
- Best performance achieved with learning rate reduction in continued training phase
Troubleshooting
- Out of Memory: Reduce batch size or gradient accumulation steps
- Poor Performance:
- Check dataset quality
- Consider continued training with lower learning rate
- Ensure proper language token usage (
>>heb<<
and>>arc<<
)
- Language Token Issues: Ensure
>>arc<<
token is properly added to tokenizer - Training Loss Not Decreasing: Try continued training with reduced learning rate (5e-6)
- Mixed Language Output: This indicates under-training; continue training for more epochs
Citation
If you use this model, please cite:
@misc{marianmt-he2arc-sam,
author = {John Locke Jr.},
title = {Samaritan Hebrew to Samaritan Tarumic Aramaic Translation Model},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face model repository},
howpublished = {\url{https://huggingface.co/johnlockejrr/marianmt-he2arc-sam}},
}
Base Model: Helsinki-NLP/opus-mt-sem-sem Dataset: Custom biblical Hebrew-Aramaic parallel corpus
- Downloads last month
- 16
Model tree for johnlockejrr/marianmt-he2arc-sam
Base model
Helsinki-NLP/opus-mt-sem-sem