fatmaserry/AraT5v2-arabic-summarization
AraT5v2-arabic-summarization is a fine-tuned version of UBC-NLP/AraT5v2-base-1024
, built to perform abstractive summarization of Modern Standard Arabic (MSA) text. It was trained on SummARai v1.0, a high-quality dataset of Arabic book and article summaries, and is part of a hybrid NLP pipeline for Arabic document summarization.
Table of Contents
- Dataset
- Training Details
- Evaluation Results
- How to Use
- Intended Use
- Authors
- Citation
- Acknowledgements
Dataset
The model was trained on SummARai v1.0, which includes 4,328 aligned Arabic paragraphs and abstractive summaries. Data sources include:
- Literary and educational Arabic books from Hindawi, Noor Book, and Foula.
- Human-written summaries from EngzKetab and Rajaoshow.
- Enhanced via LLaMA-3 and LLaMA-4 for semantic chunking and coherence filtering.
Training Details
Summarizes the fine-tuning configuration, including model settings, hyperparameters, and dataset split.
Hyperparameter | Value |
---|---|
Learning Rate | 2e-5 |
Epochs | 10 |
Batch Size | 16 |
Max Seq Length | 1024 |
Train/Validation | 80% / 20% |
Machine
Training was conducted on the following hardware setups via Theta Labs:
- GPU: NVIDIA H100 80GB × 1
- CPU: 10 cores
- Memory: 80 GB RAM
- Storage: 512 GB ephemeral disk
- Region: asia-southeast-1
Evaluation Results
The model was evaluated using BERTScore on the held-out validation split:
Metric | Score |
---|---|
Precision | 77.55% |
Recall | 66.54% |
F1 | 71.73% |
For reference, csebuetnlp/mt5_multilingual_xlsum achieved F1 = 61.72% on the same data.
How to Use
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("fatmaserry/AraT5v2-arabic-summarization")
model = AutoModelForSeq2SeqLM.from_pretrained("fatmaserry/AraT5v2-arabic-summarization")
text = "ضع هنا فقرة باللغة العربية للتلخيص"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024)
summary_ids = model.generate(**inputs, max_length=150, num_beams=4)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)
Intended Use
- Summarization of Arabic long-form content (books)
- Academic research on Arabic NLP
- Applications in education, journalism, and digital libraries
Authors
- Fatma El-Zahraa Serry
- Rowan Madeeh
- Abdelrahman Gomaa
- Momen Mostafa
- Abdelrahman Akeel
Supervisor: Dr. Mohammad El-Ramly
Citation
@misc{summarai2025,
title={SummARai: A Transformer-Based System for Hybrid Summarization of Large-Scale Arabic Documents},
author={
Fatma El-Zahraa Ashraf Serry and
Rowan Madeeh and
Abdelrahman Gomaa and
Momen Mostafa and
Abdelrahman Akeel
},
year={2025},
note={Submitted to ICICIS 2025},
url={https://github.com/fatmaserry/SummARai}
}
Acknowledgements
- 💻 GPU compute support provided by Theta Labs
- 🎓 Project completed as part of a graduation thesis at
Cairo University – Faculty of Computers and Artificial Intelligence
- Downloads last month
- 398
Model tree for fatmaserry/AraT5v2-arabic-summarization
Base model
UBC-NLP/AraT5v2-base-1024