File size: 5,706 Bytes
50b10aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
---
language:
- arc
tags:
- diacritization
- aramaic
- vocalization
- targum
- semitic-languages
- sequence-to-sequence
license: mit
base_model: Helsinki-NLP/opus-mt-afa-afa
library_name: transformers
---

# Aramaic Diacritization Model (MarianMT)

This model is a fine-tuned MarianMT model for Aramaic text diacritization (vocalization), converting consonantal Aramaic text to fully vocalized text with nikkud (vowel points).

## Model Description

- **Model type:** MarianMT (Encoder-Decoder Transformer)
- **Language:** Aramaic (arc2arc)
- **Task:** Text diacritization/vocalization
- **Base model:** [Helsinki-NLP/opus-mt-afa-afa](https://huggingface.co/Helsinki-NLP/opus-mt-afa-afa)
- **Parameters:** 61,924,352 (61.9M)

## Model Architecture

- **Architecture:** MarianMT (Marian Machine Translation)
- **Encoder layers:** 6
- **Decoder layers:** 6
- **Hidden size:** 512
- **Attention heads:** 8
- **Feed-forward dimension:** 2048
- **Vocabulary size:** 33,714
- **Max sequence length:** 512 tokens
- **Activation function:** Swish
- **Position embeddings:** Static

## Training Details

### Training Configuration
- **Training data:** 12,110 examples
- **Validation data:** 1,514 examples
- **Batch size:** 8
- **Gradient accumulation steps:** 2
- **Effective batch size:** 16
- **Learning rate:** 1e-5
- **Warmup steps:** 1,000
- **Max epochs:** 100
- **Training completed at:** Epoch 36.33
- **Mixed precision:** FP16 enabled

### Training Metrics
- **Final training loss:** 0.283
- **Training runtime:** 21,727 seconds (~6 hours)
- **Training samples per second:** 55.7
- **Training steps per second:** 3.48

## Evaluation Results

### Test Set Performance
- **BLEU Score:** 72.90
- **Character Accuracy:** 63.78%
- **Evaluation Loss:** 0.088
- **Evaluation Runtime:** 311.5 seconds
- **Evaluation samples per second:** 4.86

## Usage

### Basic Usage

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
model_name = "johnlockejrr/aramaic-diacritization-model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Example input (consonantal Aramaic text)
consonantal_text = "讘拽讚诪讬谉 讘专讗 讬讬 讬转 砖诪讬讗 讜讬转 讗专注讗"

# Tokenize input
inputs = tokenizer(consonantal_text, return_tensors="pt", max_length=512, truncation=True)

# Generate vocalized text
outputs = model.generate(**inputs, max_length=512, num_beams=4, early_stopping=True)

# Decode output
vocalized_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Input: {consonantal_text}")
print(f"Output: {vocalized_text}")
```

### Using the Pipeline

```python
from transformers import pipeline

diacritizer = pipeline("text2text-generation", model="johnlockejrr/aramaic-diacritization-model")

# Process text
consonantal_text = "讘专讗砖讬转 讘专讗 讗诇讛讬诐 讗转 讛砖诪讬诐 讜讗转 讛讗专抓"
vocalized_text = diacritizer(consonantal_text)[0]['generated_text']
print(vocalized_text)
```

## Training Data

The model was trained on a custom Aramaic diacritization dataset with the following characteristics:

- **Source:** Consonantal Aramaic text (without vowel points)
- **Target:** Vocalized Aramaic text (with nikkud/vowel points)
- **Data format:** CSV with columns: consonantal, vocalized, book, chapter, verse
- **Data split:** 80% train, 10% validation, 10% test
- **Text cleaning:** Preserves nikkud in target text, removes punctuation from source

### Data Preprocessing
- **Input cleaning:** Removes punctuation and formatting while preserving letters
- **Target preservation:** Maintains all nikkud (vowel points) and diacritical marks
- **Length filtering:** Removes sequences shorter than 2 characters or longer than 1000 characters
- **Duplicate handling:** Removes exact duplicates to prevent data leakage

## Limitations and Bias

- **Domain specificity:** Trained primarily on religious/biblical Aramaic texts
- **Vocabulary coverage:** Limited to the vocabulary present in the training corpus
- **Length constraints:** Maximum input/output length of 512 tokens
- **Style consistency:** May not handle modern Aramaic dialects or contemporary usage
- **Performance:** Character accuracy of ~64% indicates room for improvement

## Environmental Impact

- **Hardware used:** NVIDIA GPU (GTX 3060 12GB)
- **Training time:** ~6 hours
- **Carbon emissions:** Estimated low (single GPU, moderate training time)
- **Energy efficiency:** FP16 mixed precision used to reduce memory usage

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{aramaic-diacritization-2024,
  title={Aramaic Diacritization Model},
  author={Your Name},
  year={2024},
  howpublished={Hugging Face Model Hub},
  url={https://huggingface.co/johnlockejrr/aramaic-diacritization-model}
}
```

## License

[Specify your license here]

## Acknowledgments

- Base model: [Helsinki-NLP/opus-mt-afa-afa](https://huggingface.co/Helsinki-NLP/opus-mt-afa-afa)
- Training framework: Hugging Face Transformers
- Dataset: Custom Aramaic diacritization corpus

## Model Files

- `model.safetensors` - Model weights (234MB)
- `config.json` - Model configuration
- `tokenizer_config.json` - Tokenizer configuration
- `source.spm` / `target.spm` - SentencePiece models
- `vocab.json` - Vocabulary file
- `generation_config.json` - Generation parameters

## Training Scripts

The model was trained using custom scripts:
- `train_arc2arc_improved_deep.py` - Main training script
- `run_arc2arc_improved_deep.sh` - Training execution script
- `run_resume_arc2arc_deep.sh` - Resume training script

## Contact

For questions, issues, or contributions, please open an issue on the model repository.