AraT5-Summarization-XLSum

This model is a fine-tuned version of UBC-NLP/AraT5v2-base-1024 specifically trained for Arabic text summarization using the XLSum dataset. It can generate concise, fluent summaries of Arabic news articles and longer texts.

Model Sources

Paper: AraT5: Text-to-Text Transformers for Arabic Language Understanding and Generation
Dataset: XLSum

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Constants for token lengths
TEXT_MAX_TOKEN_LENGTH = 512
SUMMARY_MAX_TOKEN_LENGTH = 192

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("omarsabri8756/AraT5v2-XLSum-arabic-text-summarization")
model = AutoModelForSeq2SeqLM.from_pretrained("omarsabri8756/AraT5v2-XLSum-arabic-text-summarization")

def generate_summary(test_samples, model):
    inputs = tokenizer(
        test_samples,
        padding="max_length",
        truncation=True,
        max_length=TEXT_MAX_TOKEN_LENGTH,
        return_tensors="pt",
    )
    input_ids = inputs.input_ids.to(model.device)
    attention_mask = inputs.attention_mask.to(model.device)
    outputs = model.generate(
        input_ids, 
        attention_mask=attention_mask,
        max_length=SUMMARY_MAX_TOKEN_LENGTH,
        min_length=10,   # Ensure minimum content
        num_beams=3,
        repetition_penalty=3.0,
        length_penalty=2.0,
        no_repeat_ngram_size = 3
    )
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    return output_str

# Example usage
arabic_text = "شهدت مدينة طرابلس، مساء أمس الأربعاء، احتجاجات شعبية وأعمال شغب لليوم الثالث على التوالي، وذلك بسبب تردي الوضع المعيشي والاقتصادي. واندلعت مواجهات عنيفة وعمليات كر وفر ما بين الجيش اللبناني والمحتجين استمرت لساعات، إثر محاولة فتح الطرقات المقطوعة، ما أدى إلى إصابة العشرات من الطرفين."
summary = generate_summary(arabic_text, model)
print(summary) # شهدت مدينة طرابلس اللبنانية، مساء أمس الأربعاء، احتجاجات شعبية وأعمال شغب لليوم الثالث على التوالي، وذلك بسبب تردي الوضع المعيشي.

Training Hyperparameters

Training regime: Mixed precision fp16
Optimizer: AdamW
Learning rate: 5e-5 with a cosine scheduler
Per_device_train_batch_size: 4
Per_device_eval_batch_size: 4
Training epochs: 5
Weight_decay: 0.01

Evaluation

Epoch	Training Loss	Validation Loss	Rouge1	Rouge2	Rougel	Rougelsum
1	3.999300	2.415354	0.219800	0.106600	0.222500	0.219970
2	2.879100	2.392637	0.241690	0.103100	0.242780	0.241080
3	2.646900	2.324992	0.235450	0.106600	0.237000	0.235820
4	2.472900	2.312325	0.261720	0.120900	0.263300	0.261340
5	2.388600	2.314750	0.267520	0.120900	0.269470	0.266570