fatmaserry/AraT5v2-arabic-summarization

AraT5v2-arabic-summarization is a fine-tuned version of UBC-NLP/AraT5v2-base-1024, built to perform abstractive summarization of Modern Standard Arabic (MSA) text. It was trained on SummARai v1.0, a high-quality dataset of Arabic book and article summaries, and is part of a hybrid NLP pipeline for Arabic document summarization.

Dataset
Training Details
Evaluation Results
How to Use
Intended Use
Authors
Citation
Acknowledgements

Dataset

The model was trained on SummARai v1.0, which includes 4,328 aligned Arabic paragraphs and abstractive summaries. Data sources include:

Literary and educational Arabic books from Hindawi, Noor Book, and Foula.
Human-written summaries from EngzKetab and Rajaoshow.
Enhanced via LLaMA-3 and LLaMA-4 for semantic chunking and coherence filtering.

🔗 Dataset GitHub Repo

Training Details

Summarizes the fine-tuning configuration, including model settings, hyperparameters, and dataset split.

Hyperparameter	Value
Learning Rate	2e-5
Epochs	10
Batch Size	16
Max Seq Length	1024
Train/Validation	80% / 20%

Machine

Training was conducted on the following hardware setups via Theta Labs:

GPU: NVIDIA H100 80GB × 1
CPU: 10 cores
Memory: 80 GB RAM
Storage: 512 GB ephemeral disk
Region: asia-southeast-1

Evaluation Results

The model was evaluated using BERTScore on the held-out validation split:

Metric	Score
Precision	77.55%
Recall	66.54%
F1	71.73%

For reference, csebuetnlp/mt5_multilingual_xlsum achieved F1 = 61.72% on the same data.

How to Use

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("fatmaserry/AraT5v2-arabic-summarization")
model = AutoModelForSeq2SeqLM.from_pretrained("fatmaserry/AraT5v2-arabic-summarization")

text = "ضع هنا فقرة باللغة العربية للتلخيص"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024)

summary_ids = model.generate(**inputs, max_length=150, num_beams=4)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print(summary)

Intended Use

Summarization of Arabic long-form content (books)
Academic research on Arabic NLP
Applications in education, journalism, and digital libraries

Authors

Fatma El-Zahraa Serry
Rowan Madeeh
Abdelrahman Gomaa
Momen Mostafa
Abdelrahman Akeel

Supervisor: Dr. Mohammad El-Ramly

Citation

@misc{summarai2025,
  title={SummARai: A Transformer-Based System for Hybrid Summarization of Large-Scale Arabic Documents},
  author={
    Fatma El-Zahraa Ashraf Serry and 
    Rowan Madeeh and 
    Abdelrahman Gomaa and 
    Momen Mostafa and 
    Abdelrahman Akeel
  },
  year={2025},
  note={Submitted to ICICIS 2025},
  url={https://github.com/fatmaserry/SummARai}
}

Acknowledgements

💻 GPU compute support provided by Theta Labs
🎓 Project completed as part of a graduation thesis at
Cairo University – Faculty of Computers and Artificial Intelligence

fatmaserry
/

AraT5v2-arabic-summarization