johnlockejrr's picture
Update README.md
70417a6 verified
---
language:
- arc
tags:
- diacritization
- aramaic
- vocalization
- targum
- semitic-languages
- sequence-to-sequence
license: mit
base_model: Helsinki-NLP/opus-mt-afa-afa
library_name: transformers
---
# Aramaic Diacritization Model (MarianMT)
This model is a fine-tuned MarianMT model for Aramaic text diacritization (vocalization), converting consonantal Aramaic text to fully vocalized text with nikkud (vowel points).
## Model Description
- **Model type:** MarianMT (Encoder-Decoder Transformer)
- **Language:** Aramaic (arc2arc)
- **Task:** Text diacritization/vocalization
- **Base model:** [Helsinki-NLP/opus-mt-afa-afa](https://huggingface.co/Helsinki-NLP/opus-mt-afa-afa)
- **Parameters:** 61,924,352 (61.9M)
## Model Architecture
- **Architecture:** MarianMT (Marian Machine Translation)
- **Encoder layers:** 6
- **Decoder layers:** 6
- **Hidden size:** 512
- **Attention heads:** 8
- **Feed-forward dimension:** 2048
- **Vocabulary size:** 33,714
- **Max sequence length:** 512 tokens
- **Activation function:** Swish
- **Position embeddings:** Static
## Training Details
### Training Configuration
- **Training data:** 12,110 examples
- **Validation data:** 1,514 examples
- **Batch size:** 8
- **Gradient accumulation steps:** 2
- **Effective batch size:** 16
- **Learning rate:** 1e-5
- **Warmup steps:** 1,000
- **Max epochs:** 100
- **Training completed at:** Epoch 36.33
- **Mixed precision:** FP16 enabled
### Training Metrics
- **Final training loss:** 0.283
- **Training runtime:** 21,727 seconds (~6 hours)
- **Training samples per second:** 55.7
- **Training steps per second:** 3.48
## Evaluation Results
### Test Set Performance
- **BLEU Score:** 72.90
- **Character Accuracy:** 63.78%
- **Evaluation Loss:** 0.088
- **Evaluation Runtime:** 311.5 seconds
- **Evaluation samples per second:** 4.86
## Usage
### Basic Usage
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load model and tokenizer
model_name = "johnlockejrr/aramaic-diacritization-model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Example input (consonantal Aramaic text)
consonantal_text = "讘拽讚诪讬谉 讘专讗 讬讬 讬转 砖诪讬讗 讜讬转 讗专注讗"
# Tokenize input
inputs = tokenizer(consonantal_text, return_tensors="pt", max_length=512, truncation=True)
# Generate vocalized text
outputs = model.generate(**inputs, max_length=512, num_beams=4, early_stopping=True)
# Decode output
vocalized_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Input: {consonantal_text}")
print(f"Output: {vocalized_text}")
```
### Using the Pipeline
```python
from transformers import pipeline
diacritizer = pipeline("text2text-generation", model="johnlockejrr/aramaic-diacritization-model")
# Process text
consonantal_text = "讘专讗砖讬转 讘专讗 讗诇讛讬诐 讗转 讛砖诪讬诐 讜讗转 讛讗专抓"
vocalized_text = diacritizer(consonantal_text)[0]['generated_text']
print(vocalized_text)
```
## Training Data
The model was trained on a custom Aramaic diacritization dataset with the following characteristics:
- **Source:** Consonantal Aramaic text (without vowel points)
- **Target:** Vocalized Aramaic text (with nikkud/vowel points)
- **Data format:** CSV with columns: consonantal, vocalized, book, chapter, verse
- **Data split:** 80% train, 10% validation, 10% test
- **Text cleaning:** Preserves nikkud in target text, removes punctuation from source
### Data Preprocessing
- **Input cleaning:** Removes punctuation and formatting while preserving letters
- **Target preservation:** Maintains all nikkud (vowel points) and diacritical marks
- **Length filtering:** Removes sequences shorter than 2 characters or longer than 1000 characters
- **Duplicate handling:** Removes exact duplicates to prevent data leakage
## Limitations and Bias
- **Domain specificity:** Trained primarily on religious/biblical Aramaic texts
- **Vocabulary coverage:** Limited to the vocabulary present in the training corpus
- **Length constraints:** Maximum input/output length of 512 tokens
- **Style consistency:** May not handle modern Aramaic dialects or contemporary usage
- **Performance:** Character accuracy of ~64% indicates room for improvement
## Environmental Impact
- **Hardware used:** NVIDIA GPU (GTX 3060 12GB)
- **Training time:** ~6 hours
- **Carbon emissions:** Estimated low (single GPU, moderate training time)
- **Energy efficiency:** FP16 mixed precision used to reduce memory usage
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{aramaic-diacritization-2024,
title={Aramaic Diacritization Model},
author={John Locke Jr.},
year={2025},
howpublished={Hugging Face Model Hub},
url={https://huggingface.co/johnlockejrr/aramaic-diacritization-model}
}
```
## License
[MIT]
## Acknowledgments
- Base model: [Helsinki-NLP/opus-mt-afa-afa](https://huggingface.co/Helsinki-NLP/opus-mt-afa-afa)
- Training framework: Hugging Face Transformers
- Dataset: Custom Aramaic diacritization corpus
## Model Files
- `model.safetensors` - Model weights (234MB)
- `config.json` - Model configuration
- `tokenizer_config.json` - Tokenizer configuration
- `source.spm` / `target.spm` - SentencePiece models
- `vocab.json` - Vocabulary file
- `generation_config.json` - Generation parameters
## Training Scripts
The model was trained using custom scripts:
- `train_arc2arc_improved_deep.py` - Main training script
- `run_arc2arc_improved_deep.sh` - Training execution script
- `run_resume_arc2arc_deep.sh` - Resume training script
## Contact
For questions, issues, or contributions, please open an issue on the model repository.