opus-mt-arc-heb / README.md
johnlockejrr's picture
Update README.md
48bff22 verified
---
license: mit
language:
- he
- arc
base_model:
- Helsinki-NLP/opus-mt-mul-en
---
# Hebrew-Aramaic Translation Model
This project provides a complete pipeline for fine-tuning MarianMT models on Hebrew-Aramaic parallel texts, specifically designed for translating between Hebrew (Samaritan) and Aramaic (Targum) texts.
## Overview
The pipeline consists of:
1. **Dataset Preparation** (`prepare_dataset.py`) - Processes the aligned corpus and splits it into train/validation/test sets
2. **Model Training** (`train_translation_model.py`) - Fine-tunes a pre-trained MarianMT model
3. **Inference** (`inference.py`) - Provides translation functionality using the trained model
## Data Format
The input data should be in TSV format with the following columns:
- `Book` - Book identifier
- `Chapter` - Chapter number
- `Verse` - Verse number
- `Targum` - Aramaic text
- `Samaritan` - Hebrew text
Example:
```
Book|Chapter|Verse|Targum|Samaritan
1|3|2|מן אילן גנה ניכל|מפרי עץ הגן נאכל
```
## Installation
1. Clone this repository:
```bash
git clone <repository-url>
cd sam-aram
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. (Optional) Install CUDA for GPU acceleration if available.
## Usage
### Step 1: Prepare the Dataset
First, prepare your aligned corpus for training:
```bash
python prepare_dataset.py \
--input_file aligned_corpus.tsv \
--output_dir ./hebrew_aramaic_dataset \
--test_size 0.1 \
--val_size 0.1
```
This will:
- Load the TSV file
- Clean and filter the data
- Split into train/validation/test sets
- Save the processed dataset
### Step 2: Train the Model
Train a translation model using the prepared dataset:
```bash
python train_translation_model.py \
--dataset_path ./hebrew_aramaic_dataset \
--output_dir ./hebrew_aramaic_model \
--model_name Helsinki-NLP/opus-mt-mul-en \
--direction he2arc \
--batch_size 8 \
--learning_rate 2e-5 \
--num_epochs 3 \
--use_fp16
```
#### Key Parameters:
- `--model_name`: Pre-trained model to fine-tune. Recommended options:
- `Helsinki-NLP/opus-mt-mul-en` (multilingual)
- `Helsinki-NLP/opus-mt-he-en` (Hebrew-English)
- `Helsinki-NLP/opus-mt-ar-en` (Arabic-English)
- `--direction`: Translation direction (`he2arc` or `arc2he`)
- `--batch_size`: Training batch size (adjust based on GPU memory)
- `--learning_rate`: Learning rate for fine-tuning
- `--num_epochs`: Number of training epochs
- `--use_fp16`: Enable mixed precision training (faster, less memory)
#### Training with Weights & Biases (Optional):
```bash
python train_translation_model.py \
--dataset_path ./hebrew_aramaic_dataset \
--output_dir ./hebrew_aramaic_model \
--model_name Helsinki-NLP/opus-mt-mul-en \
--use_wandb
```
### Step 3: Use the Trained Model
#### Interactive Translation:
```bash
python inference.py --model_path ./hebrew_aramaic_model
```
#### Translate a Single Text:
```bash
python inference.py \
--model_path ./hebrew_aramaic_model \
--text "מפרי עץ הגן נאכל" \
--direction he2arc
```
#### Batch Translation:
```bash
python inference.py \
--model_path ./hebrew_aramaic_model \
--input_file input_texts.txt \
--output_file translations.txt \
--direction he2arc
```
## Model Recommendations
Based on the information in `info.txt`, here are recommended pre-trained models for Hebrew-Aramaic translation:
### 1. Multilingual Models
- `Helsinki-NLP/opus-mt-mul-en` - Good starting point for multilingual fine-tuning
- `facebook/m2m100_1.2B` - Large multilingual model with Hebrew and Aramaic support
### 2. Hebrew-Related Models
- `Helsinki-NLP/opus-mt-he-en` - Hebrew to English (can be adapted)
- `Helsinki-NLP/opus-mt-heb-ara` - Hebrew to Arabic (Semitic language family)
### 3. Arabic-Related Models
- `Helsinki-NLP/opus-mt-ar-en` - Arabic to English (Aramaic is related to Arabic)
- `Helsinki-NLP/opus-mt-ar-heb` - Arabic to Hebrew
## Training Tips
### 1. Data Quality
- Ensure your parallel texts are properly aligned
- Clean the data to remove noise and inconsistencies
- Consider the length ratio between source and target texts
### 2. Model Selection
- Start with a multilingual model if available
- Consider the vocabulary overlap between your languages
- Test different pre-trained models to find the best starting point
### 3. Hyperparameter Tuning
- Use smaller batch sizes for limited GPU memory
- Start with a lower learning rate (1e-5 to 5e-5)
- Increase epochs if the model hasn't converged
- Use early stopping to prevent overfitting
### 4. Evaluation
- Monitor BLEU score during training
- Use character-level accuracy for Hebrew/Aramaic
- Test on a held-out test set
## File Structure
```
sam-aram/
├── aligned_corpus.tsv # Input parallel corpus
├── prepare_dataset.py # Dataset preparation script
├── train_translation_model.py # Training script
├── inference.py # Inference script
├── requirements.txt # Python dependencies
├── README.md # This file
├── info.txt # Project information
├── hebrew_aramaic_dataset/ # Prepared dataset (created)
└── hebrew_aramaic_model/ # Trained model (created)
```
## Troubleshooting
### Common Issues:
1. **Out of Memory**: Reduce batch size or use gradient accumulation
2. **Poor Translation Quality**:
- Check data quality and alignment
- Try different pre-trained models
- Increase training epochs
- Adjust learning rate
3. **Tokenization Issues**:
- Ensure the tokenizer supports Hebrew/Aramaic scripts
- Check for proper UTF-8 encoding
4. **Training Instability**:
- Reduce learning rate
- Increase warmup steps
- Use gradient clipping
### Performance Optimization:
- Use mixed precision training (`--use_fp16`)
- Enable gradient accumulation for larger effective batch sizes
- Use multiple GPUs if available
- Consider model quantization for inference
## Evaluation Metrics
The training script computes:
- **BLEU Score**: Standard machine translation metric
- **Character Accuracy**: Character-level accuracy for Hebrew/Aramaic text
## Contributing
To improve the pipeline:
1. Test with different pre-trained models
2. Experiment with different data preprocessing techniques
3. Add more evaluation metrics
4. Optimize for specific use cases
## License
This project is provided as-is for research and educational purposes.
## References
- [MarianMT Documentation](https://huggingface.co/docs/transformers/en/model_doc/marian)
- [Helsinki-NLP Models](https://huggingface.co/Helsinki-NLP)
- [Transformers Library](https://huggingface.co/docs/transformers/)