--- language: - arc tags: - diacritization - aramaic - vocalization - targum - semitic-languages - sequence-to-sequence license: mit base_model: Helsinki-NLP/opus-mt-afa-afa library_name: transformers --- # Aramaic Diacritization Model (MarianMT) This model is a fine-tuned MarianMT model for Aramaic text diacritization (vocalization), converting consonantal Aramaic text to fully vocalized text with nikkud (vowel points). ## Model Description - **Model type:** MarianMT (Encoder-Decoder Transformer) - **Language:** Aramaic (arc2arc) - **Task:** Text diacritization/vocalization - **Base model:** [Helsinki-NLP/opus-mt-afa-afa](https://huggingface.co/Helsinki-NLP/opus-mt-afa-afa) - **Parameters:** 61,924,352 (61.9M) ## Model Architecture - **Architecture:** MarianMT (Marian Machine Translation) - **Encoder layers:** 6 - **Decoder layers:** 6 - **Hidden size:** 512 - **Attention heads:** 8 - **Feed-forward dimension:** 2048 - **Vocabulary size:** 33,714 - **Max sequence length:** 512 tokens - **Activation function:** Swish - **Position embeddings:** Static ## Training Details ### Training Configuration - **Training data:** 12,110 examples - **Validation data:** 1,514 examples - **Batch size:** 8 - **Gradient accumulation steps:** 2 - **Effective batch size:** 16 - **Learning rate:** 1e-5 - **Warmup steps:** 1,000 - **Max epochs:** 100 - **Training completed at:** Epoch 36.33 - **Mixed precision:** FP16 enabled ### Training Metrics - **Final training loss:** 0.283 - **Training runtime:** 21,727 seconds (~6 hours) - **Training samples per second:** 55.7 - **Training steps per second:** 3.48 ## Evaluation Results ### Test Set Performance - **BLEU Score:** 72.90 - **Character Accuracy:** 63.78% - **Evaluation Loss:** 0.088 - **Evaluation Runtime:** 311.5 seconds - **Evaluation samples per second:** 4.86 ## Usage ### Basic Usage ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM # Load model and tokenizer model_name = "johnlockejrr/aramaic-diacritization-model" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) # Example input (consonantal Aramaic text) consonantal_text = "בקדמין ברא יי ית שמיא וית ארעא" # Tokenize input inputs = tokenizer(consonantal_text, return_tensors="pt", max_length=512, truncation=True) # Generate vocalized text outputs = model.generate(**inputs, max_length=512, num_beams=4, early_stopping=True) # Decode output vocalized_text = tokenizer.decode(outputs[0], skip_special_tokens=True) print(f"Input: {consonantal_text}") print(f"Output: {vocalized_text}") ``` ### Using the Pipeline ```python from transformers import pipeline diacritizer = pipeline("text2text-generation", model="johnlockejrr/aramaic-diacritization-model") # Process text consonantal_text = "בראשית ברא אלהים את השמים ואת הארץ" vocalized_text = diacritizer(consonantal_text)[0]['generated_text'] print(vocalized_text) ``` ## Training Data The model was trained on a custom Aramaic diacritization dataset with the following characteristics: - **Source:** Consonantal Aramaic text (without vowel points) - **Target:** Vocalized Aramaic text (with nikkud/vowel points) - **Data format:** CSV with columns: consonantal, vocalized, book, chapter, verse - **Data split:** 80% train, 10% validation, 10% test - **Text cleaning:** Preserves nikkud in target text, removes punctuation from source ### Data Preprocessing - **Input cleaning:** Removes punctuation and formatting while preserving letters - **Target preservation:** Maintains all nikkud (vowel points) and diacritical marks - **Length filtering:** Removes sequences shorter than 2 characters or longer than 1000 characters - **Duplicate handling:** Removes exact duplicates to prevent data leakage ## Limitations and Bias - **Domain specificity:** Trained primarily on religious/biblical Aramaic texts - **Vocabulary coverage:** Limited to the vocabulary present in the training corpus - **Length constraints:** Maximum input/output length of 512 tokens - **Style consistency:** May not handle modern Aramaic dialects or contemporary usage - **Performance:** Character accuracy of ~64% indicates room for improvement ## Environmental Impact - **Hardware used:** NVIDIA GPU (GTX 3060 12GB) - **Training time:** ~6 hours - **Carbon emissions:** Estimated low (single GPU, moderate training time) - **Energy efficiency:** FP16 mixed precision used to reduce memory usage ## Citation If you use this model in your research, please cite: ```bibtex @misc{aramaic-diacritization-2024, title={Aramaic Diacritization Model}, author={Your Name}, year={2024}, howpublished={Hugging Face Model Hub}, url={https://huggingface.co/johnlockejrr/aramaic-diacritization-model} } ``` ## License [Specify your license here] ## Acknowledgments - Base model: [Helsinki-NLP/opus-mt-afa-afa](https://huggingface.co/Helsinki-NLP/opus-mt-afa-afa) - Training framework: Hugging Face Transformers - Dataset: Custom Aramaic diacritization corpus ## Model Files - `model.safetensors` - Model weights (234MB) - `config.json` - Model configuration - `tokenizer_config.json` - Tokenizer configuration - `source.spm` / `target.spm` - SentencePiece models - `vocab.json` - Vocabulary file - `generation_config.json` - Generation parameters ## Training Scripts The model was trained using custom scripts: - `train_arc2arc_improved_deep.py` - Main training script - `run_arc2arc_improved_deep.sh` - Training execution script - `run_resume_arc2arc_deep.sh` - Resume training script ## Contact For questions, issues, or contributions, please open an issue on the model repository.