|
--- |
|
license: mit |
|
language: |
|
- he |
|
- arc |
|
base_model: |
|
- Helsinki-NLP/opus-mt-mul-en |
|
--- |
|
# Hebrew-Aramaic Translation Model |
|
|
|
This project provides a complete pipeline for fine-tuning MarianMT models on Hebrew-Aramaic parallel texts, specifically designed for translating between Hebrew (Samaritan) and Aramaic (Targum) texts. |
|
|
|
## Overview |
|
|
|
The pipeline consists of: |
|
1. **Dataset Preparation** (`prepare_dataset.py`) - Processes the aligned corpus and splits it into train/validation/test sets |
|
2. **Model Training** (`train_translation_model.py`) - Fine-tunes a pre-trained MarianMT model |
|
3. **Inference** (`inference.py`) - Provides translation functionality using the trained model |
|
|
|
## Data Format |
|
|
|
The input data should be in TSV format with the following columns: |
|
- `Book` - Book identifier |
|
- `Chapter` - Chapter number |
|
- `Verse` - Verse number |
|
- `Targum` - Aramaic text |
|
- `Samaritan` - Hebrew text |
|
|
|
Example: |
|
``` |
|
Book|Chapter|Verse|Targum|Samaritan |
|
1|3|2|מן אילן גנה ניכל|מפרי עץ הגן נאכל |
|
``` |
|
|
|
## Installation |
|
|
|
1. Clone this repository: |
|
```bash |
|
git clone <repository-url> |
|
cd sam-aram |
|
``` |
|
|
|
2. Install dependencies: |
|
```bash |
|
pip install -r requirements.txt |
|
``` |
|
|
|
3. (Optional) Install CUDA for GPU acceleration if available. |
|
|
|
## Usage |
|
|
|
### Step 1: Prepare the Dataset |
|
|
|
First, prepare your aligned corpus for training: |
|
|
|
```bash |
|
python prepare_dataset.py \ |
|
--input_file aligned_corpus.tsv \ |
|
--output_dir ./hebrew_aramaic_dataset \ |
|
--test_size 0.1 \ |
|
--val_size 0.1 |
|
``` |
|
|
|
This will: |
|
- Load the TSV file |
|
- Clean and filter the data |
|
- Split into train/validation/test sets |
|
- Save the processed dataset |
|
|
|
### Step 2: Train the Model |
|
|
|
Train a translation model using the prepared dataset: |
|
|
|
```bash |
|
python train_translation_model.py \ |
|
--dataset_path ./hebrew_aramaic_dataset \ |
|
--output_dir ./hebrew_aramaic_model \ |
|
--model_name Helsinki-NLP/opus-mt-mul-en \ |
|
--direction he2arc \ |
|
--batch_size 8 \ |
|
--learning_rate 2e-5 \ |
|
--num_epochs 3 \ |
|
--use_fp16 |
|
``` |
|
|
|
#### Key Parameters: |
|
|
|
- `--model_name`: Pre-trained model to fine-tune. Recommended options: |
|
- `Helsinki-NLP/opus-mt-mul-en` (multilingual) |
|
- `Helsinki-NLP/opus-mt-he-en` (Hebrew-English) |
|
- `Helsinki-NLP/opus-mt-ar-en` (Arabic-English) |
|
- `--direction`: Translation direction (`he2arc` or `arc2he`) |
|
- `--batch_size`: Training batch size (adjust based on GPU memory) |
|
- `--learning_rate`: Learning rate for fine-tuning |
|
- `--num_epochs`: Number of training epochs |
|
- `--use_fp16`: Enable mixed precision training (faster, less memory) |
|
|
|
#### Training with Weights & Biases (Optional): |
|
|
|
```bash |
|
python train_translation_model.py \ |
|
--dataset_path ./hebrew_aramaic_dataset \ |
|
--output_dir ./hebrew_aramaic_model \ |
|
--model_name Helsinki-NLP/opus-mt-mul-en \ |
|
--use_wandb |
|
``` |
|
|
|
### Step 3: Use the Trained Model |
|
|
|
#### Interactive Translation: |
|
|
|
```bash |
|
python inference.py --model_path ./hebrew_aramaic_model |
|
``` |
|
|
|
#### Translate a Single Text: |
|
|
|
```bash |
|
python inference.py \ |
|
--model_path ./hebrew_aramaic_model \ |
|
--text "מפרי עץ הגן נאכל" \ |
|
--direction he2arc |
|
``` |
|
|
|
#### Batch Translation: |
|
|
|
```bash |
|
python inference.py \ |
|
--model_path ./hebrew_aramaic_model \ |
|
--input_file input_texts.txt \ |
|
--output_file translations.txt \ |
|
--direction he2arc |
|
``` |
|
|
|
## Model Recommendations |
|
|
|
Based on the information in `info.txt`, here are recommended pre-trained models for Hebrew-Aramaic translation: |
|
|
|
### 1. Multilingual Models |
|
- `Helsinki-NLP/opus-mt-mul-en` - Good starting point for multilingual fine-tuning |
|
- `facebook/m2m100_1.2B` - Large multilingual model with Hebrew and Aramaic support |
|
|
|
### 2. Hebrew-Related Models |
|
- `Helsinki-NLP/opus-mt-he-en` - Hebrew to English (can be adapted) |
|
- `Helsinki-NLP/opus-mt-heb-ara` - Hebrew to Arabic (Semitic language family) |
|
|
|
### 3. Arabic-Related Models |
|
- `Helsinki-NLP/opus-mt-ar-en` - Arabic to English (Aramaic is related to Arabic) |
|
- `Helsinki-NLP/opus-mt-ar-heb` - Arabic to Hebrew |
|
|
|
## Training Tips |
|
|
|
### 1. Data Quality |
|
- Ensure your parallel texts are properly aligned |
|
- Clean the data to remove noise and inconsistencies |
|
- Consider the length ratio between source and target texts |
|
|
|
### 2. Model Selection |
|
- Start with a multilingual model if available |
|
- Consider the vocabulary overlap between your languages |
|
- Test different pre-trained models to find the best starting point |
|
|
|
### 3. Hyperparameter Tuning |
|
- Use smaller batch sizes for limited GPU memory |
|
- Start with a lower learning rate (1e-5 to 5e-5) |
|
- Increase epochs if the model hasn't converged |
|
- Use early stopping to prevent overfitting |
|
|
|
### 4. Evaluation |
|
- Monitor BLEU score during training |
|
- Use character-level accuracy for Hebrew/Aramaic |
|
- Test on a held-out test set |
|
|
|
## File Structure |
|
|
|
``` |
|
sam-aram/ |
|
├── aligned_corpus.tsv # Input parallel corpus |
|
├── prepare_dataset.py # Dataset preparation script |
|
├── train_translation_model.py # Training script |
|
├── inference.py # Inference script |
|
├── requirements.txt # Python dependencies |
|
├── README.md # This file |
|
├── info.txt # Project information |
|
├── hebrew_aramaic_dataset/ # Prepared dataset (created) |
|
└── hebrew_aramaic_model/ # Trained model (created) |
|
``` |
|
|
|
## Troubleshooting |
|
|
|
### Common Issues: |
|
|
|
1. **Out of Memory**: Reduce batch size or use gradient accumulation |
|
2. **Poor Translation Quality**: |
|
- Check data quality and alignment |
|
- Try different pre-trained models |
|
- Increase training epochs |
|
- Adjust learning rate |
|
|
|
3. **Tokenization Issues**: |
|
- Ensure the tokenizer supports Hebrew/Aramaic scripts |
|
- Check for proper UTF-8 encoding |
|
|
|
4. **Training Instability**: |
|
- Reduce learning rate |
|
- Increase warmup steps |
|
- Use gradient clipping |
|
|
|
### Performance Optimization: |
|
|
|
- Use mixed precision training (`--use_fp16`) |
|
- Enable gradient accumulation for larger effective batch sizes |
|
- Use multiple GPUs if available |
|
- Consider model quantization for inference |
|
|
|
## Evaluation Metrics |
|
|
|
The training script computes: |
|
- **BLEU Score**: Standard machine translation metric |
|
- **Character Accuracy**: Character-level accuracy for Hebrew/Aramaic text |
|
|
|
## Contributing |
|
|
|
To improve the pipeline: |
|
1. Test with different pre-trained models |
|
2. Experiment with different data preprocessing techniques |
|
3. Add more evaluation metrics |
|
4. Optimize for specific use cases |
|
|
|
## License |
|
|
|
This project is provided as-is for research and educational purposes. |
|
|
|
## References |
|
|
|
- [MarianMT Documentation](https://huggingface.co/docs/transformers/en/model_doc/marian) |
|
- [Helsinki-NLP Models](https://huggingface.co/Helsinki-NLP) |
|
- [Transformers Library](https://huggingface.co/docs/transformers/) |