Samaritan Hebrew to Samaritan Targumic Aramaic Translation Model

This model fine-tunes the Helsinki-NLP/opus-mt-sem-sem model for Samaritan Hebrew to Samaritan Targumic Aramaic translation, where Aramaic is written in Hebrew script. The model is specifically trained on biblical texts of the Samaritan Torah extant in the BL Or. 7562 manuscript.

Model Details

Model Name: johnlockejrr/marianmt-he2arc-sam
Base Model: Helsinki-NLP/opus-mt-sem-sem
Language Pair: Samaritan Hebrew → Samaritan Targumic Aramaic (he2arc)
Script: Aramaic written in Hebrew characters
Domain: Biblical and religious texts
License: MIT

Overview

The model is based on the Helsinki-NLP/opus-mt-sem-sem model, which is designed for Semitic language translation. We fine-tune it specifically for Samaritan Hebrew to Samaritan Targumic Aramaic translation using a custom limited dataset of biblical texts.

Key Features

Base Model: Helsinki-NLP/opus-mt-sem-sem
Direction: Samaritan Hebrew → Samaritan Targumic Aramaic (he2arc)
Script: Aramaic written in Hebrew characters
Language Tokens: Uses >>heb<< for Hebrew and >>arc<< for Aramaic
Optimized: Configured for 12GB GPU with appropriate batch sizes and learning rates

Dataset

The training uses the hebrew_aramaic_dataset directory which contains:

hebrew: Hebrew source text
aramaic: Aramaic target text (in Hebrew script)
Additional metadata: book, chapter, verse

Training Configuration

Initial Training (20 epochs)

Batch Size: 8 (optimized for 12GB GPU)
Learning Rate: 1e-5
Epochs: 20
Warmup Steps: 1000
Gradient Accumulation: 4 (effective batch size: 32)
FP16: Enabled for faster training
Language Prefix: Uses >>heb<< and >>arc<< tokens

Continued Training (30 additional epochs)

Learning Rate: 5e-6 (lower for fine-tuning)
Epochs: 30 additional (50 total)
Warmup Steps: 500 (less for continued training)
Early Stopping: 5 patience (more patience)

Performance Metrics

Final Results (50 epochs total)

BLEU Score: 48.14
Training Loss: 0.96
Test Loss: 1.02
Character Accuracy: 41.90%
Vocabulary Size: 33,701
Model Parameters: 61,917,696

Training Progress

Metric	Initial (20 epochs)	Final (50 epochs)	Improvement
Training Loss	4.36	0.96	78% reduction
BLEU Score	40.32	48.14	+7.82 points
Char Accuracy	36.95%	41.90%	+4.95%

Usage

Training

Initial Training:
```
cd he2arc
./run_he2arc_improved.sh
```

Continued Training (recommended):

cd he2arc
./run_continued_training.sh

Or manually:

python continue_training.py \
    --dataset_path ../hebrew_aramaic_dataset \
    --checkpoint_path ./hebrew_aramaic_model_improved \
    --output_dir ./hebrew_aramaic_model_continued \
    --learning_rate 5e-6 \
    --num_epochs 30 \
    --use_fp16

Inference

Single text translation:

python inference.py \
    --model_path ./hebrew_aramaic_model_continued \
    --text "שלום עולם" \
    --direction he2arc

Using Hugging Face Hub:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "johnlockejrr/marianmt-he2arc-sam"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Translate Hebrew to Aramaic
text = "שלום עולם"
inputs = tokenizer(f">>heb<< {text}", return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(**inputs, max_length=512, num_beams=4)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Translation: {translation}")

Batch translation from file:

python inference.py \
    --model_path ./hebrew_aramaic_model_continued \
    --input_file input_texts.txt \
    --output_file translations.txt \
    --direction he2arc

Interactive mode:

python inference.py --model_path ./hebrew_aramaic_model_continued

Model Information

The final trained model is saved in ./hebrew_aramaic_model_continued/ with:

model.safetensors: Model weights
tokenizer_config.json: Tokenizer configuration
model_info.json: Training information and metadata
training_args.bin: Training arguments
test_results.json: Final evaluation results
all_results.json: Complete training history

Language Tokens

The model uses the following language tokens:

>>heb<<: Hebrew (source language)
>>arc<<: Aramaic (target language)

These tokens are added to the tokenizer during training and used during inference to specify the translation direction.

Performance

Final Performance Metrics

BLEU Score: 48.14 (good for Hebrew→Aramaic translation)
Character Accuracy: 41.90% (script-level accuracy)
Training Loss: 0.96 (excellent, indicates good learning)
Test Loss: 1.02 (good generalization)

Training Efficiency

Total Training Time: ~32 minutes (20 + 12 minutes)
Samples per Second: 114.5
Steps per Second: 3.58
Effective Batch Size: 32 (8 × 4 gradient accumulation)

Requirements

Python 3.8+
PyTorch
Transformers
Datasets
Evaluate (for BLEU calculation)
CUDA-compatible GPU (recommended)

Notes

The model is specifically optimized for Hebrew to Aramaic translation
Aramaic text is expected to be in Hebrew script
The base model (sem-sem) supports multiple Semitic languages
Training logs are saved to he2arc_training.log and he2arc_continued_training.log
The model shows significant improvement with continued training (50 epochs total)
Best performance achieved with learning rate reduction in continued training phase

Troubleshooting

Out of Memory: Reduce batch size or gradient accumulation steps
Poor Performance:
- Check dataset quality
- Consider continued training with lower learning rate
- Ensure proper language token usage (>>heb<< and >>arc<<)
Language Token Issues: Ensure >>arc<< token is properly added to tokenizer
Training Loss Not Decreasing: Try continued training with reduced learning rate (5e-6)
Mixed Language Output: This indicates under-training; continue training for more epochs

Citation

If you use this model, please cite:

@misc{marianmt-he2arc-sam,
  author = {John Locke Jr.},
  title = {Samaritan Hebrew to Samaritan Tarumic Aramaic Translation Model},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face model repository},
  howpublished = {\url{https://huggingface.co/johnlockejrr/marianmt-he2arc-sam}},
}

Base Model: Helsinki-NLP/opus-mt-sem-sem Dataset: Custom biblical Hebrew-Aramaic parallel corpus

johnlockejrr
/

marianmt-he2arc-sam