File size: 5,706 Bytes
50b10aa |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 |
---
language:
- arc
tags:
- diacritization
- aramaic
- vocalization
- targum
- semitic-languages
- sequence-to-sequence
license: mit
base_model: Helsinki-NLP/opus-mt-afa-afa
library_name: transformers
---
# Aramaic Diacritization Model (MarianMT)
This model is a fine-tuned MarianMT model for Aramaic text diacritization (vocalization), converting consonantal Aramaic text to fully vocalized text with nikkud (vowel points).
## Model Description
- **Model type:** MarianMT (Encoder-Decoder Transformer)
- **Language:** Aramaic (arc2arc)
- **Task:** Text diacritization/vocalization
- **Base model:** [Helsinki-NLP/opus-mt-afa-afa](https://huggingface.co/Helsinki-NLP/opus-mt-afa-afa)
- **Parameters:** 61,924,352 (61.9M)
## Model Architecture
- **Architecture:** MarianMT (Marian Machine Translation)
- **Encoder layers:** 6
- **Decoder layers:** 6
- **Hidden size:** 512
- **Attention heads:** 8
- **Feed-forward dimension:** 2048
- **Vocabulary size:** 33,714
- **Max sequence length:** 512 tokens
- **Activation function:** Swish
- **Position embeddings:** Static
## Training Details
### Training Configuration
- **Training data:** 12,110 examples
- **Validation data:** 1,514 examples
- **Batch size:** 8
- **Gradient accumulation steps:** 2
- **Effective batch size:** 16
- **Learning rate:** 1e-5
- **Warmup steps:** 1,000
- **Max epochs:** 100
- **Training completed at:** Epoch 36.33
- **Mixed precision:** FP16 enabled
### Training Metrics
- **Final training loss:** 0.283
- **Training runtime:** 21,727 seconds (~6 hours)
- **Training samples per second:** 55.7
- **Training steps per second:** 3.48
## Evaluation Results
### Test Set Performance
- **BLEU Score:** 72.90
- **Character Accuracy:** 63.78%
- **Evaluation Loss:** 0.088
- **Evaluation Runtime:** 311.5 seconds
- **Evaluation samples per second:** 4.86
## Usage
### Basic Usage
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load model and tokenizer
model_name = "johnlockejrr/aramaic-diacritization-model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Example input (consonantal Aramaic text)
consonantal_text = "讘拽讚诪讬谉 讘专讗 讬讬 讬转 砖诪讬讗 讜讬转 讗专注讗"
# Tokenize input
inputs = tokenizer(consonantal_text, return_tensors="pt", max_length=512, truncation=True)
# Generate vocalized text
outputs = model.generate(**inputs, max_length=512, num_beams=4, early_stopping=True)
# Decode output
vocalized_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Input: {consonantal_text}")
print(f"Output: {vocalized_text}")
```
### Using the Pipeline
```python
from transformers import pipeline
diacritizer = pipeline("text2text-generation", model="johnlockejrr/aramaic-diacritization-model")
# Process text
consonantal_text = "讘专讗砖讬转 讘专讗 讗诇讛讬诐 讗转 讛砖诪讬诐 讜讗转 讛讗专抓"
vocalized_text = diacritizer(consonantal_text)[0]['generated_text']
print(vocalized_text)
```
## Training Data
The model was trained on a custom Aramaic diacritization dataset with the following characteristics:
- **Source:** Consonantal Aramaic text (without vowel points)
- **Target:** Vocalized Aramaic text (with nikkud/vowel points)
- **Data format:** CSV with columns: consonantal, vocalized, book, chapter, verse
- **Data split:** 80% train, 10% validation, 10% test
- **Text cleaning:** Preserves nikkud in target text, removes punctuation from source
### Data Preprocessing
- **Input cleaning:** Removes punctuation and formatting while preserving letters
- **Target preservation:** Maintains all nikkud (vowel points) and diacritical marks
- **Length filtering:** Removes sequences shorter than 2 characters or longer than 1000 characters
- **Duplicate handling:** Removes exact duplicates to prevent data leakage
## Limitations and Bias
- **Domain specificity:** Trained primarily on religious/biblical Aramaic texts
- **Vocabulary coverage:** Limited to the vocabulary present in the training corpus
- **Length constraints:** Maximum input/output length of 512 tokens
- **Style consistency:** May not handle modern Aramaic dialects or contemporary usage
- **Performance:** Character accuracy of ~64% indicates room for improvement
## Environmental Impact
- **Hardware used:** NVIDIA GPU (GTX 3060 12GB)
- **Training time:** ~6 hours
- **Carbon emissions:** Estimated low (single GPU, moderate training time)
- **Energy efficiency:** FP16 mixed precision used to reduce memory usage
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{aramaic-diacritization-2024,
title={Aramaic Diacritization Model},
author={Your Name},
year={2024},
howpublished={Hugging Face Model Hub},
url={https://huggingface.co/johnlockejrr/aramaic-diacritization-model}
}
```
## License
[Specify your license here]
## Acknowledgments
- Base model: [Helsinki-NLP/opus-mt-afa-afa](https://huggingface.co/Helsinki-NLP/opus-mt-afa-afa)
- Training framework: Hugging Face Transformers
- Dataset: Custom Aramaic diacritization corpus
## Model Files
- `model.safetensors` - Model weights (234MB)
- `config.json` - Model configuration
- `tokenizer_config.json` - Tokenizer configuration
- `source.spm` / `target.spm` - SentencePiece models
- `vocab.json` - Vocabulary file
- `generation_config.json` - Generation parameters
## Training Scripts
The model was trained using custom scripts:
- `train_arc2arc_improved_deep.py` - Main training script
- `run_arc2arc_improved_deep.sh` - Training execution script
- `run_resume_arc2arc_deep.sh` - Resume training script
## Contact
For questions, issues, or contributions, please open an issue on the model repository.
|