|
--- |
|
language: |
|
- arc |
|
tags: |
|
- diacritization |
|
- aramaic |
|
- vocalization |
|
- targum |
|
- semitic-languages |
|
- sequence-to-sequence |
|
license: mit |
|
base_model: Helsinki-NLP/opus-mt-afa-afa |
|
library_name: transformers |
|
--- |
|
|
|
# Aramaic Diacritization Model (MarianMT) |
|
|
|
This model is a fine-tuned MarianMT model for Aramaic text diacritization (vocalization), converting consonantal Aramaic text to fully vocalized text with nikkud (vowel points). |
|
|
|
## Model Description |
|
|
|
- **Model type:** MarianMT (Encoder-Decoder Transformer) |
|
- **Language:** Aramaic (arc2arc) |
|
- **Task:** Text diacritization/vocalization |
|
- **Base model:** [Helsinki-NLP/opus-mt-afa-afa](https://huggingface.co/Helsinki-NLP/opus-mt-afa-afa) |
|
- **Parameters:** 61,924,352 (61.9M) |
|
|
|
## Model Architecture |
|
|
|
- **Architecture:** MarianMT (Marian Machine Translation) |
|
- **Encoder layers:** 6 |
|
- **Decoder layers:** 6 |
|
- **Hidden size:** 512 |
|
- **Attention heads:** 8 |
|
- **Feed-forward dimension:** 2048 |
|
- **Vocabulary size:** 33,714 |
|
- **Max sequence length:** 512 tokens |
|
- **Activation function:** Swish |
|
- **Position embeddings:** Static |
|
|
|
## Training Details |
|
|
|
### Training Configuration |
|
- **Training data:** 12,110 examples |
|
- **Validation data:** 1,514 examples |
|
- **Batch size:** 8 |
|
- **Gradient accumulation steps:** 2 |
|
- **Effective batch size:** 16 |
|
- **Learning rate:** 1e-5 |
|
- **Warmup steps:** 1,000 |
|
- **Max epochs:** 100 |
|
- **Training completed at:** Epoch 36.33 |
|
- **Mixed precision:** FP16 enabled |
|
|
|
### Training Metrics |
|
- **Final training loss:** 0.283 |
|
- **Training runtime:** 21,727 seconds (~6 hours) |
|
- **Training samples per second:** 55.7 |
|
- **Training steps per second:** 3.48 |
|
|
|
## Evaluation Results |
|
|
|
### Test Set Performance |
|
- **BLEU Score:** 72.90 |
|
- **Character Accuracy:** 63.78% |
|
- **Evaluation Loss:** 0.088 |
|
- **Evaluation Runtime:** 311.5 seconds |
|
- **Evaluation samples per second:** 4.86 |
|
|
|
## Usage |
|
|
|
### Basic Usage |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
# Load model and tokenizer |
|
model_name = "johnlockejrr/aramaic-diacritization-model" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_name) |
|
|
|
# Example input (consonantal Aramaic text) |
|
consonantal_text = "讘拽讚诪讬谉 讘专讗 讬讬 讬转 砖诪讬讗 讜讬转 讗专注讗" |
|
|
|
# Tokenize input |
|
inputs = tokenizer(consonantal_text, return_tensors="pt", max_length=512, truncation=True) |
|
|
|
# Generate vocalized text |
|
outputs = model.generate(**inputs, max_length=512, num_beams=4, early_stopping=True) |
|
|
|
# Decode output |
|
vocalized_text = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
print(f"Input: {consonantal_text}") |
|
print(f"Output: {vocalized_text}") |
|
``` |
|
|
|
### Using the Pipeline |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
diacritizer = pipeline("text2text-generation", model="johnlockejrr/aramaic-diacritization-model") |
|
|
|
# Process text |
|
consonantal_text = "讘专讗砖讬转 讘专讗 讗诇讛讬诐 讗转 讛砖诪讬诐 讜讗转 讛讗专抓" |
|
vocalized_text = diacritizer(consonantal_text)[0]['generated_text'] |
|
print(vocalized_text) |
|
``` |
|
|
|
## Training Data |
|
|
|
The model was trained on a custom Aramaic diacritization dataset with the following characteristics: |
|
|
|
- **Source:** Consonantal Aramaic text (without vowel points) |
|
- **Target:** Vocalized Aramaic text (with nikkud/vowel points) |
|
- **Data format:** CSV with columns: consonantal, vocalized, book, chapter, verse |
|
- **Data split:** 80% train, 10% validation, 10% test |
|
- **Text cleaning:** Preserves nikkud in target text, removes punctuation from source |
|
|
|
### Data Preprocessing |
|
- **Input cleaning:** Removes punctuation and formatting while preserving letters |
|
- **Target preservation:** Maintains all nikkud (vowel points) and diacritical marks |
|
- **Length filtering:** Removes sequences shorter than 2 characters or longer than 1000 characters |
|
- **Duplicate handling:** Removes exact duplicates to prevent data leakage |
|
|
|
## Limitations and Bias |
|
|
|
- **Domain specificity:** Trained primarily on religious/biblical Aramaic texts |
|
- **Vocabulary coverage:** Limited to the vocabulary present in the training corpus |
|
- **Length constraints:** Maximum input/output length of 512 tokens |
|
- **Style consistency:** May not handle modern Aramaic dialects or contemporary usage |
|
- **Performance:** Character accuracy of ~64% indicates room for improvement |
|
|
|
## Environmental Impact |
|
|
|
- **Hardware used:** NVIDIA GPU (GTX 3060 12GB) |
|
- **Training time:** ~6 hours |
|
- **Carbon emissions:** Estimated low (single GPU, moderate training time) |
|
- **Energy efficiency:** FP16 mixed precision used to reduce memory usage |
|
|
|
## Citation |
|
|
|
If you use this model in your research, please cite: |
|
|
|
```bibtex |
|
@misc{aramaic-diacritization-2024, |
|
title={Aramaic Diacritization Model}, |
|
author={John Locke Jr.}, |
|
year={2025}, |
|
howpublished={Hugging Face Model Hub}, |
|
url={https://huggingface.co/johnlockejrr/aramaic-diacritization-model} |
|
} |
|
``` |
|
|
|
## License |
|
|
|
[MIT] |
|
|
|
## Acknowledgments |
|
|
|
- Base model: [Helsinki-NLP/opus-mt-afa-afa](https://huggingface.co/Helsinki-NLP/opus-mt-afa-afa) |
|
- Training framework: Hugging Face Transformers |
|
- Dataset: Custom Aramaic diacritization corpus |
|
|
|
## Model Files |
|
|
|
- `model.safetensors` - Model weights (234MB) |
|
- `config.json` - Model configuration |
|
- `tokenizer_config.json` - Tokenizer configuration |
|
- `source.spm` / `target.spm` - SentencePiece models |
|
- `vocab.json` - Vocabulary file |
|
- `generation_config.json` - Generation parameters |
|
|
|
## Training Scripts |
|
|
|
The model was trained using custom scripts: |
|
- `train_arc2arc_improved_deep.py` - Main training script |
|
- `run_arc2arc_improved_deep.sh` - Training execution script |
|
- `run_resume_arc2arc_deep.sh` - Resume training script |
|
|
|
## Contact |
|
|
|
For questions, issues, or contributions, please open an issue on the model repository. |
|
|