Model Card for Model ID
This modelcard aims to be a base template for new models. It has been generated using this raw template.
Model Details
Model Description
auto-edit-hy-500m is a specialized model for auto-editing Armenian (hy) text, using the AutoEditForConditionalGeneration architecture from the hy-models library. With approximately 500M parameters, it processes Armenian text to correct errors, ensuring clarity and accuracy across various contexts. The model was trained on a synthetic dataset created from the SAW-corpus, incorporating diverse error patterns to support robust auto-editing capabilities for Armenian NLP.
It supports text with Markdown formatting, including lists and tables.
- Developed by: MMinasyan (https://github.com/MMinasyan)
- Model type: text-to-text
- Language(s) (NLP): Armenian (hy)
- License: Apache-2.0
Model Sources [optional]
- Repository: [More Information Needed]
Uses
Direct Use
The model is intended for direct use in auto-editing Armenian text, correcting errors to improve clarity, accuracy, and overall quality.
Out-of-Scope Use
Use with non-Armenian languages.
Tasks beyond auto-editing, such as general language modeling or translation.
Bias, Risks, and Limitations
Limited to Armenian text, with no support for other languages.
No evaluation metrics are available due to the lack of comparable Armenian auto-editing models.
Recommendations
Users should test the model on their specific use cases to ensure it meets their needs.
How to Get Started with the Model
import torch
from transformers import AutoTokenizer
from hy_models import AutoEditForConditionalGeneration
# Install hy-models: pip install git+https://github.com/MMinasyan/hy-models
tokenizer = AutoTokenizer.from_pretrained("Syntheresis/auto-edit-hy-500m")
model = AutoEditForConditionalGeneration.from_pretrained("Syntheresis/auto-edit-hy-500m").to("cuda")
# Auto-edit random morqur's youtube comment
input_text = "շատ գրագետ խոսումեք: բայց փաստն այնե որ գողությունը ավելացելա հետեվաբար, ձեր խոսոլը զրոե պեքե աշխատել"
inputs = tokenizer([input_text], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs)
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(output_text)
# Շատ գրագետ խոսում եք, բայց փաստն այն է, որ գողությունը ավելացել է, հետեւաբար, ձեր խոսելը զրո է, պետք է աշխատել
Training Details
Training Data
The model was trained on a synthetic dataset derived from the SAW-corpus (https://huggingface.co/datasets/Syntheresis/SAW-corpus). The synthetic dataset was created by:
Regenerating masked sequences with token-level and character-level generative models.
Applying back-translation with open-source machine-translation models for text variety.
Introducing random grammatical errors with over 500 grammatical mistake patterns.
Training Procedure
Training Hyperparameters
- Training regime: Training regime: Mixed precision bf16, batch size increasing from 36,864 to 147,456, ~500,000 steps, with warmup and CosineAnnealing at a peak learning rate of 0.0001.
Technical Specifications
Model Architecture and Objective
Architecture: Text-to-text model with cross-attention, pre-norm, Rotary Position Embeddings (RePE), and Grouped-Query Attention (GQA).
Objective: Auto-editing Armenian text to correct errors and improve quality.
Compute Infrastructure
Hardware
Dual NVIDIA RTX 4090 GPUs
Software
Hugging Face transformers library (version 4.49.0.dev0)
hy-models library (https://github.com/MMinasyan/hy-models)