Isnad AI: AraBERT for Ayah & Hadith Span Detection in LLM Outputs

Isnad AI - Islamic Citation Detection

This repository contains the official fine-tuned model for the Isnad AI system, the submission to the IslamicEval 2025 Shared Task 1A. The model is designed to identify character-level spans of Quranic verses (Ayahs) and Prophetic sayings (Hadiths) within text generated by Large Language Models (LLMs).

By: Fatimah Emad Eldin

Cairo University

📜 Model Description

This model fine-tunes AraBERTv2 (aubmindlab/bert-base-arabertv2) on a specialized token classification task. Its purpose is to label tokens within a given Arabic text according to the BIO schema:

B-Ayah (Beginning of a Quranic verse)
I-Ayah (Inside a Quranic verse)
B-Hadith (Beginning of a Prophetic saying)
I-Hadith (Inside a Prophetic saying)
O (Outside of any religious citation)

The key innovation behind this model is a novel rule-based data generation pipeline that programmatically creates a large-scale, high-quality training corpus from authentic religious texts, completely eliminating the need for manual annotation. This method proved highly effective, enabling the model to learn the contextual patterns of how LLMs cite Islamic sources.

🚀 How to Use

You can easily use this model with the transformers library pipeline for token-classification (or ner). For best results, use aggregation_strategy="simple" to group token pieces into coherent entities.

from transformers import pipeline

# Load the token classification pipeline
model_id = "FatimahEmadEldin/Isnad-AI-Identifying-Islamic-Citation"
islamic_ner = pipeline(
    "token-classification",
    model=model_id,
    aggregation_strategy="simple"
)

# Example text from an LLM response
text = "يوضح لنا الدين أهمية الصدق، ففي الحديث الشريف نجد أن النبي قال: عليكم بالصدق. كما أنزل الله في كتابه الكريم: يا أيها الذين آمنوا اتقوا الله وكونوا مع الصادقين."

# Get the identified spans
results = islamic_ner(text)

# Print the results
for entity in results:
    print(f"Entity: {entity['word']}")
    print(f"Label: {entity['entity_group']}")
    print(f"Score: {entity['score']:.4f}\n")

# Expected output:
# Entity: عليكم بالصدق
# Label: Hadith
# Score: 0.9876

# Entity: يا أيها الذين آمنوا اتقوا الله وكونوا مع الصادقين
# Label: Ayah
# Score: 0.9912

⚙️ Training Procedure

Data Generation

The model was trained exclusively on a synthetically generated dataset to overcome the lack of manually annotated data for this specific task. The pipeline involved several stages:

Data Sourcing: Authentic texts were sourced from quran.json (containing all Quranic verses) and a JSON file of the Six Major Hadith Collections.
Text Preprocessing: Long Ayahs were split into smaller segments to prevent sequence truncation, and data was augmented by creating versions with and without Arabic diacritics (Tashkeel).
Template-Based Generation: Each religious text was embedded into realistic contextual templates using a curated list of common prefixes (e.g., "قال الله تعالى:") and suffixes (e.g., "صدق الله العظيم"). Noise was also injected by adding neutral sentences to better simulate LLM outputs.

Fine-Tuning

The aubmindlab/bert-base-arabertv2 model was fine-tuned with the following key hyperparameters:

Learning Rate: 2e-5
Epochs: 10 (with early stopping patience of 3)
Effective Batch Size: 16
Optimizer: AdamW
Mixed Precision: fp16 enabled

📊 Evaluation Results

The model was evaluated using the official character-level Macro F1-Score metric for the IslamicEval 2025 shared task.

Official Test Set Results

The system achieved a final F1-score of 66.97% on the blind test set, demonstrating the effectiveness of the rule-based data generation approach.

Methodology	Test F1 Score
Isnad AI (Rule-Based Model)	66.97%
Generative Data (Ablation)	50.50%
Database Lookup (Ablation)	34.80%

🔍 Highlight: Development Set Performance

A detailed evaluation on the manually annotated development set provided by the organizers shows a strong and balanced performance.

Final Macro F1-Score on Dev Set: 65.08%

Per-Class Performance (Character-Level)

Class	Precision	Recall	F1-Score
🟢 Neither	0.8423	0.9688	0.9011
🔵 Ayah	0.8326	0.5574	0.6678
🟡 Hadith	0.4750	0.3333	0.3917
Overall	0.7166	0.6198	0.6535

(These results are from the official scoring.py script run on the development set).

⚠️ Limitations and Bias

Performance on Hadith: The model's primary challenge is identifying Hadith texts, which have significantly more linguistic and structural variety than Quranic verses. The F1-score for the Hadith class is lower than for Ayah, indicating it may miss or misclassify some prophetic sayings.
Template Dependency: The model's knowledge is based on the rule-based templates used for training. It may be less effective at identifying citations that appear in highly novel or unconventional contexts not represented in the training data.
Scope: This model identifies intended citations, as per the shared task rules. It does not verify the authenticity or correctness of the citation itself. An LLM could generate a completely fabricated verse, and this model would still identify it if it is presented like a real one.

✍️ Citation

If you use this model or the methodology in your research, please cite the paper:

  Coming soon

FatimahEmadEldin
/

Isnad-AI-Identifying-Islamic-Citation