fatmaserry/AraT5v2-arabic-summarization

AraT5v2-arabic-summarization is a fine-tuned version of UBC-NLP/AraT5v2-base-1024, built to perform abstractive summarization of Modern Standard Arabic (MSA) text. It was trained on SummARai v1.0, a high-quality dataset of Arabic book and article summaries, and is part of a hybrid NLP pipeline for Arabic document summarization.


Table of Contents


Dataset

The model was trained on SummARai v1.0, which includes 4,328 aligned Arabic paragraphs and abstractive summaries. Data sources include:

  • Literary and educational Arabic books from Hindawi, Noor Book, and Foula.
  • Human-written summaries from EngzKetab and Rajaoshow.
  • Enhanced via LLaMA-3 and LLaMA-4 for semantic chunking and coherence filtering.

🔗 Dataset GitHub Repo


Training Details

Summarizes the fine-tuning configuration, including model settings, hyperparameters, and dataset split.

Hyperparameter Value
Learning Rate 2e-5
Epochs 10
Batch Size 16
Max Seq Length 1024
Train/Validation 80% / 20%

Machine

Training was conducted on the following hardware setups via Theta Labs:

  • GPU: NVIDIA H100 80GB × 1
  • CPU: 10 cores
  • Memory: 80 GB RAM
  • Storage: 512 GB ephemeral disk
  • Region: asia-southeast-1

Evaluation Results

The model was evaluated using BERTScore on the held-out validation split:

Metric Score
Precision 77.55%
Recall 66.54%
F1 71.73%

For reference, csebuetnlp/mt5_multilingual_xlsum achieved F1 = 61.72% on the same data.


How to Use

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("fatmaserry/AraT5v2-arabic-summarization")
model = AutoModelForSeq2SeqLM.from_pretrained("fatmaserry/AraT5v2-arabic-summarization")

text = "ضع هنا فقرة باللغة العربية للتلخيص"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024)

summary_ids = model.generate(**inputs, max_length=150, num_beams=4)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print(summary)

Intended Use

  • Summarization of Arabic long-form content (books)
  • Academic research on Arabic NLP
  • Applications in education, journalism, and digital libraries

Authors

  • Fatma El-Zahraa Serry
  • Rowan Madeeh
  • Abdelrahman Gomaa
  • Momen Mostafa
  • Abdelrahman Akeel

Supervisor: Dr. Mohammad El-Ramly


Citation

@misc{summarai2025,
  title={SummARai: A Transformer-Based System for Hybrid Summarization of Large-Scale Arabic Documents},
  author={
    Fatma El-Zahraa Ashraf Serry and 
    Rowan Madeeh and 
    Abdelrahman Gomaa and 
    Momen Mostafa and 
    Abdelrahman Akeel
  },
  year={2025},
  note={Submitted to ICICIS 2025},
  url={https://github.com/fatmaserry/SummARai}
}

Acknowledgements

  • 💻 GPU compute support provided by Theta Labs
  • 🎓 Project completed as part of a graduation thesis at
    Cairo University – Faculty of Computers and Artificial Intelligence
Downloads last month
398
Safetensors
Model size
368M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fatmaserry/AraT5v2-arabic-summarization

Finetuned
(21)
this model