Qwen2.5-1.5B Arabic Summarizer
This model is a fine-tuned version of unsloth/Qwen2.5-1.5B-Instruct for Arabic summarization.
It was trained using the TRL and transformers
libraries with Parameter-Efficient Fine-Tuning (PEFT) via LoRA.
Model Description
This is a 1.5B parameter small language model (SLM) fine-tuned on a synthetically generated dataset for Arabic summarization. High-quality summaries were generated using a larger model ("Qwen/Qwen2.5-14B-Instruct-AWQ") on Arabic documents derived from the GEM/xlsum dataset.
The model was trained using supervised fine-tuning (SFT) with LoRA adapters, enabling training on consumer GPUs with limited memory (e.g., 16GB).
Intended Use
This model is intended for generating concise, accurate Arabic summaries from input texts. It performs best when used with the specific prompt format seen during training.
Training Data
Training used a synthetic summarization dataset created as follows:
- Source: Arabic subset of the GEM/xlsum dataset
- Steps:
- Noise cleaning and Arabic text normalization
- Filtering for text length (300–2500 characters)
- Duplicate removal (SHA1 hashing)
- Language filtering (Arabic-dominant only)
- Topic stratification using TF-IDF + NMF (~5000 samples)
- Synthetic summaries generated by "Qwen/Qwen2.5-14B-Instruct-AWQ"
Training Procedure
Key details:
- Base model:
unsloth/Qwen2.5-1.5B-Instruct
- LoRA (PEFT) settings:
r
: 16,alpha
: 16,dropout
: 0.1- Target modules:
q_proj
,v_proj
,up_proj
,down_proj
- Quantization: 4-bit NF4 (
bnb_4bit_compute_dtype=torch.bfloat16
) - Optimizer:
paged_adamw_32bit
- Learning rate: 2e-4, cosine schedule, warmup 3%
- Epochs: 2
- Batch: 2 per device, 4 accumulation steps
- Eval: Based on validation loss
- Gradient clipping: 0.3
- Checkpointing: Best eval loss model saved
Framework Versions
- TRL: 0.18.0
- Transformers: 4.52.3
- PyTorch: 2.7.0
- Datasets: 3.6.0
- Tokenizers: 0.21.1
How to Use
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
from peft import PeftModel
import torch
base_model = AutoModelForCausalLM.from_pretrained(
"unsloth/Qwen2.5-1.5B-Instruct",
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen2.5-1.5B-Instruct")
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
model = PeftModel.from_pretrained(base_model, "ml-maverick/Qwen2.5-1.5B-Instruct-ArabicSum")
model = model.merge_and_unload()
model.eval()
instruction = (
"أنت كاتب عربي محترف ذو خبرة واسعة في تلخيص النصوص بدقة وإيجاز."
" عند استلام نص، اتبع الخطوات التالية لضمان تقديم ملخص فعّال:\n"
"1. قم بتحليل المحتوى بعناية لتحديد الفكرة الرئيسية.\n"
"2. استخرج المعلومات الجوهرية.\n"
"3. صغ ملخصًا واضحًا وموجزًا لا يتجاوز ثلاث جمل.\n"
"4. تجنب التفاصيل غير الموجودة، والتزم بالدقة.\n\n"
)
text = "أظهرت دراسة حديثة أن..."
input_prompt = f"{instruction}{text}\n\nالملخص:"
input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids.to(model.device)
generation_config = GenerationConfig(
max_new_tokens=200,
num_beams=1,
early_stopping=True,
repetition_penalty=1.1,
temperature=0.4,
top_p=0.9,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
with torch.no_grad():
output_ids = model.generate(
input_ids=input_ids,
generation_config=generation_config,
attention_mask=input_ids.ne(tokenizer.pad_token_id),
)
output = tokenizer.decode(output_ids[0], skip_special_tokens=True)
summary = output.split("الملخص:")[-1].strip()
print("Generated Summary:", summary)
Limitations and Bias
- Synthetic bias from source LLM
- Requires exact prompt format
- Arabic only
- Not guaranteed factual accuracy
Citation
@misc{vonwerra2022trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {Leandro von Werra et al.},
year = 2022,
howpublished = {\url{https://github.com/huggingface/trl}}
}
@article{qwen2024qwen2,
title={{Qwen2}: A Strong Large Language Model Family},
author={Qwen Team},
journal={arXiv preprint arXiv:2406.01175},
year={2024}
}
@article{wolf2020transformers,
title={Transformers: State-of-the-Art NLP},
author={Wolf, Thomas et al.},
journal={arXiv:1910.03771},
year={2020}
}
@article{lhoest2021datasets,
title={Datasets: A Community Library},
author={Lhoest, Quentin et al.},
journal={arXiv:2109.02844},
year={2021}
}
@software{peft,
title={{PEFT}: Parameter-Efficient Fine-Tuning},
author={Hugging Face},
year={2023},
url={https://github.com/huggingface/peft}
}
Model tree for ml-maverick/Qwen2.5-1.5B-Instruct-ArabicSum
Base model
Qwen/Qwen2.5-1.5B