opus-mt-arc-heb / README.md

Update README.md

48bff22 verified about 2 months ago

6.74 kB

	---
	license: mit
	language:
	- he
	- arc
	base_model:
	- Helsinki-NLP/opus-mt-mul-en
	---
	# Hebrew-Aramaic Translation Model

	This project provides a complete pipeline for fine-tuning MarianMT models on Hebrew-Aramaic parallel texts, specifically designed for translating between Hebrew (Samaritan) and Aramaic (Targum) texts.

	## Overview

	The pipeline consists of:
	1. Dataset Preparation (`prepare_dataset.py`) - Processes the aligned corpus and splits it into train/validation/test sets
	2. Model Training (`train_translation_model.py`) - Fine-tunes a pre-trained MarianMT model
	3. Inference (`inference.py`) - Provides translation functionality using the trained model

	## Data Format

	The input data should be in TSV format with the following columns:
	- `Book` - Book identifier
	- `Chapter` - Chapter number
	- `Verse` - Verse number
	- `Targum` - Aramaic text
	- `Samaritan` - Hebrew text

	Example:
	```
	Book\|Chapter\|Verse\|Targum\|Samaritan
	1\|3\|2\|מן אילן גנה ניכל\|מפרי עץ הגן נאכל
	```

	## Installation

	1. Clone this repository:
	```bash
	git clone <repository-url>
	cd sam-aram
	```

	2. Install dependencies:
	```bash
	pip install -r requirements.txt
	```

	3. (Optional) Install CUDA for GPU acceleration if available.

	## Usage

	### Step 1: Prepare the Dataset

	First, prepare your aligned corpus for training:

	```bash
	python prepare_dataset.py \
	--input_file aligned_corpus.tsv \
	--output_dir ./hebrew_aramaic_dataset \
	--test_size 0.1 \
	--val_size 0.1
	```

	This will:
	- Load the TSV file
	- Clean and filter the data
	- Split into train/validation/test sets
	- Save the processed dataset

	### Step 2: Train the Model

	Train a translation model using the prepared dataset:

	```bash
	python train_translation_model.py \
	--dataset_path ./hebrew_aramaic_dataset \
	--output_dir ./hebrew_aramaic_model \
	--model_name Helsinki-NLP/opus-mt-mul-en \
	--direction he2arc \
	--batch_size 8 \
	--learning_rate 2e-5 \
	--num_epochs 3 \
	--use_fp16
	```

	#### Key Parameters:

	- `--model_name`: Pre-trained model to fine-tune. Recommended options:
	- `Helsinki-NLP/opus-mt-mul-en` (multilingual)
	- `Helsinki-NLP/opus-mt-he-en` (Hebrew-English)
	- `Helsinki-NLP/opus-mt-ar-en` (Arabic-English)
	- `--direction`: Translation direction (`he2arc` or `arc2he`)
	- `--batch_size`: Training batch size (adjust based on GPU memory)
	- `--learning_rate`: Learning rate for fine-tuning
	- `--num_epochs`: Number of training epochs
	- `--use_fp16`: Enable mixed precision training (faster, less memory)

	#### Training with Weights & Biases (Optional):

	```bash
	python train_translation_model.py \
	--dataset_path ./hebrew_aramaic_dataset \
	--output_dir ./hebrew_aramaic_model \
	--model_name Helsinki-NLP/opus-mt-mul-en \
	--use_wandb
	```

	### Step 3: Use the Trained Model

	#### Interactive Translation:

	```bash
	python inference.py --model_path ./hebrew_aramaic_model
	```

	#### Translate a Single Text:

	```bash
	python inference.py \
	--model_path ./hebrew_aramaic_model \
	--text "מפרי עץ הגן נאכל" \
	--direction he2arc
	```

	#### Batch Translation:

	```bash
	python inference.py \
	--model_path ./hebrew_aramaic_model \
	--input_file input_texts.txt \
	--output_file translations.txt \
	--direction he2arc
	```

	## Model Recommendations

	Based on the information in `info.txt`, here are recommended pre-trained models for Hebrew-Aramaic translation:

	### 1. Multilingual Models
	- `Helsinki-NLP/opus-mt-mul-en` - Good starting point for multilingual fine-tuning
	- `facebook/m2m100_1.2B` - Large multilingual model with Hebrew and Aramaic support

	### 2. Hebrew-Related Models
	- `Helsinki-NLP/opus-mt-he-en` - Hebrew to English (can be adapted)
	- `Helsinki-NLP/opus-mt-heb-ara` - Hebrew to Arabic (Semitic language family)

	### 3. Arabic-Related Models
	- `Helsinki-NLP/opus-mt-ar-en` - Arabic to English (Aramaic is related to Arabic)
	- `Helsinki-NLP/opus-mt-ar-heb` - Arabic to Hebrew

	## Training Tips

	### 1. Data Quality
	- Ensure your parallel texts are properly aligned
	- Clean the data to remove noise and inconsistencies
	- Consider the length ratio between source and target texts

	### 2. Model Selection
	- Start with a multilingual model if available
	- Consider the vocabulary overlap between your languages
	- Test different pre-trained models to find the best starting point

	### 3. Hyperparameter Tuning
	- Use smaller batch sizes for limited GPU memory
	- Start with a lower learning rate (1e-5 to 5e-5)
	- Increase epochs if the model hasn't converged
	- Use early stopping to prevent overfitting

	### 4. Evaluation
	- Monitor BLEU score during training
	- Use character-level accuracy for Hebrew/Aramaic
	- Test on a held-out test set

	## File Structure

	```
	sam-aram/
	├── aligned_corpus.tsv # Input parallel corpus
	├── prepare_dataset.py # Dataset preparation script
	├── train_translation_model.py # Training script
	├── inference.py # Inference script
	├── requirements.txt # Python dependencies
	├── README.md # This file
	├── info.txt # Project information
	├── hebrew_aramaic_dataset/ # Prepared dataset (created)
	└── hebrew_aramaic_model/ # Trained model (created)
	```

	## Troubleshooting

	### Common Issues:

	1. Out of Memory: Reduce batch size or use gradient accumulation
	2. Poor Translation Quality:
	- Check data quality and alignment
	- Try different pre-trained models
	- Increase training epochs
	- Adjust learning rate

	3. Tokenization Issues:
	- Ensure the tokenizer supports Hebrew/Aramaic scripts
	- Check for proper UTF-8 encoding

	4. Training Instability:
	- Reduce learning rate
	- Increase warmup steps
	- Use gradient clipping

	### Performance Optimization:

	- Use mixed precision training (`--use_fp16`)
	- Enable gradient accumulation for larger effective batch sizes
	- Use multiple GPUs if available
	- Consider model quantization for inference

	## Evaluation Metrics

	The training script computes:
	- BLEU Score: Standard machine translation metric
	- Character Accuracy: Character-level accuracy for Hebrew/Aramaic text

	## Contributing

	To improve the pipeline:
	1. Test with different pre-trained models
	2. Experiment with different data preprocessing techniques
	3. Add more evaluation metrics
	4. Optimize for specific use cases

	## License

	This project is provided as-is for research and educational purposes.

	## References

	- [MarianMT Documentation](https://huggingface.co/docs/transformers/en/model_doc/marian)
	- [Helsinki-NLP Models](https://huggingface.co/Helsinki-NLP)
	- [Transformers Library](https://huggingface.co/docs/transformers/)