Model Card for Model ID

This modelcard aims to be a base template for new models. It has been generated using this raw template.

Model Details

Model Description

auto-edit-hy-500m is a specialized model for auto-editing Armenian (hy) text, using the AutoEditForConditionalGeneration architecture from the hy-models library. With approximately 500M parameters, it processes Armenian text to correct errors, ensuring clarity and accuracy across various contexts. The model was trained on a synthetic dataset created from the SAW-corpus, incorporating diverse error patterns to support robust auto-editing capabilities for Armenian NLP.

It supports text with Markdown formatting, including lists and tables.

Developed by: MMinasyan (https://github.com/MMinasyan)
Model type: text-to-text
Language(s) (NLP): Armenian (hy)
License: Apache-2.0

Model Sources [optional]

Repository: [More Information Needed]

Uses

Direct Use

The model is intended for direct use in auto-editing Armenian text, correcting errors to improve clarity, accuracy, and overall quality.

Out-of-Scope Use

Use with non-Armenian languages.
Tasks beyond auto-editing, such as general language modeling or translation.

Bias, Risks, and Limitations

Limited to Armenian text, with no support for other languages.
No evaluation metrics are available due to the lack of comparable Armenian auto-editing models.

Recommendations

Users should test the model on their specific use cases to ensure it meets their needs.

How to Get Started with the Model

import torch
from transformers import AutoTokenizer
from hy_models import AutoEditForConditionalGeneration

# Install hy-models: pip install git+https://github.com/MMinasyan/hy-models

tokenizer = AutoTokenizer.from_pretrained("Syntheresis/auto-edit-hy-500m")
model = AutoEditForConditionalGeneration.from_pretrained("Syntheresis/auto-edit-hy-500m").to("cuda")

# Auto-edit random morqur's youtube comment
input_text = "շատ գրագետ խոսումեք: բայց փաստն  այնե որ գողությունը ավելացելա հետեվաբար, ձեր խոսոլը զրոե պեքե աշխատել"
inputs = tokenizer([input_text], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs)
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(output_text)
# Շատ գրագետ խոսում եք, բայց փաստն այն է, որ գողությունը ավելացել է, հետեւաբար, ձեր խոսելը զրո է, պետք է աշխատել

Training Details

Training Data

The model was trained on a synthetic dataset derived from the SAW-corpus (https://huggingface.co/datasets/Syntheresis/SAW-corpus). The synthetic dataset was created by:

Regenerating masked sequences with token-level and character-level generative models.
Applying back-translation with open-source machine-translation models for text variety.
Introducing random grammatical errors with over 500 grammatical mistake patterns.

Training Procedure

Training Hyperparameters

Training regime: Training regime: Mixed precision bf16, batch size increasing from 36,864 to 147,456, ~500,000 steps, with warmup and CosineAnnealing at a peak learning rate of 0.0001.

Technical Specifications

Model Architecture and Objective

Architecture: Text-to-text model with cross-attention, pre-norm, Rotary Position Embeddings (RePE), and Grouped-Query Attention (GQA).
Objective: Auto-editing Armenian text to correct errors and improve quality.

Compute Infrastructure

Hardware

Dual NVIDIA RTX 4090 GPUs

Software

Hugging Face transformers library (version 4.49.0.dev0)
hy-models library (https://github.com/MMinasyan/hy-models)

Model Card Contact

[email protected]

Syntheresis
/

auto-edit-hy-500m