Isnad AI: AraBERT for Ayah & Hadith Span Detection in LLM Outputs
This repository contains the official fine-tuned model for the Isnad AI system, the submission to the IslamicEval 2025 Shared Task 1A. The model is designed to identify character-level spans of Quranic verses (Ayahs) and Prophetic sayings (Hadiths) within text generated by Large Language Models (LLMs).
By: Fatimah Emad Eldin
Cairo University
๐ Model Description
This model fine-tunes AraBERTv2 (aubmindlab/bert-base-arabertv2
) on a specialized token classification task. Its purpose is to label tokens within a given Arabic text according to the BIO schema:
B-Ayah
(Beginning of a Quranic verse)I-Ayah
(Inside a Quranic verse)B-Hadith
(Beginning of a Prophetic saying)I-Hadith
(Inside a Prophetic saying)O
(Outside of any religious citation)
The key innovation behind this model is a novel rule-based data generation pipeline that programmatically creates a large-scale, high-quality training corpus from authentic religious texts, completely eliminating the need for manual annotation. This method proved highly effective, enabling the model to learn the contextual patterns of how LLMs cite Islamic sources.
๐ How to Use
You can easily use this model with the transformers
library pipeline for token-classification
(or ner
). For best results, use aggregation_strategy="simple"
to group token pieces into coherent entities.
from transformers import pipeline
# Load the token classification pipeline
model_id = "FatimahEmadEldin/Isnad-AI-Identifying-Islamic-Citation"
islamic_ner = pipeline(
"token-classification",
model=model_id,
aggregation_strategy="simple"
)
# Example text from an LLM response
text = "ููุถุญ ููุง ุงูุฏูู ุฃูู
ูุฉ ุงูุตุฏูุ ููู ุงูุญุฏูุซ ุงูุดุฑูู ูุฌุฏ ุฃู ุงููุจู ูุงู: ุนูููู
ุจุงูุตุฏู. ูู
ุง ุฃูุฒู ุงููู ูู ูุชุงุจู ุงููุฑูู
: ูุง ุฃููุง ุงูุฐูู ุขู
ููุง ุงุชููุง ุงููู ูููููุง ู
ุน ุงูุตุงุฏููู."
# Get the identified spans
results = islamic_ner(text)
# Print the results
for entity in results:
print(f"Entity: {entity['word']}")
print(f"Label: {entity['entity_group']}")
print(f"Score: {entity['score']:.4f}\n")
# Expected output:
# Entity: ุนูููู
ุจุงูุตุฏู
# Label: Hadith
# Score: 0.9876
# Entity: ูุง ุฃููุง ุงูุฐูู ุขู
ููุง ุงุชููุง ุงููู ูููููุง ู
ุน ุงูุตุงุฏููู
# Label: Ayah
# Score: 0.9912
โ๏ธ Training Procedure
Data Generation
The model was trained exclusively on a synthetically generated dataset to overcome the lack of manually annotated data for this specific task. The pipeline involved several stages:
- Data Sourcing: Authentic texts were sourced from
quran.json
(containing all Quranic verses) and a JSON file of the Six Major Hadith Collections. - Text Preprocessing: Long Ayahs were split into smaller segments to prevent sequence truncation, and data was augmented by creating versions with and without Arabic diacritics (Tashkeel).
- Template-Based Generation: Each religious text was embedded into realistic contextual templates using a curated list of common prefixes (e.g., "ูุงู ุงููู ุชุนุงูู:") and suffixes (e.g., "ุตุฏู ุงููู ุงูุนุธูู "). Noise was also injected by adding neutral sentences to better simulate LLM outputs.
Fine-Tuning
The aubmindlab/bert-base-arabertv2
model was fine-tuned with the following key hyperparameters:
- Learning Rate:
2e-5
- Epochs: 10 (with early stopping patience of 3)
- Effective Batch Size: 16
- Optimizer: AdamW
- Mixed Precision: fp16 enabled
๐ Evaluation Results
The model was evaluated using the official character-level Macro F1-Score metric for the IslamicEval 2025 shared task.
Official Test Set Results
The system achieved a final F1-score of 66.97% on the blind test set, demonstrating the effectiveness of the rule-based data generation approach.
Methodology | Test F1 Score |
---|---|
Isnad AI (Rule-Based Model) | 66.97% |
Generative Data (Ablation) | 50.50% |
Database Lookup (Ablation) | 34.80% |
๐ Highlight: Development Set Performance
A detailed evaluation on the manually annotated development set provided by the organizers shows a strong and balanced performance.
Final Macro F1-Score on Dev Set: 65.08%
Per-Class Performance (Character-Level)
Class | Precision | Recall | F1-Score |
---|---|---|---|
๐ข Neither | 0.8423 | 0.9688 | 0.9011 |
๐ต Ayah | 0.8326 | 0.5574 | 0.6678 |
๐ก Hadith | 0.4750 | 0.3333 | 0.3917 |
Overall | 0.7166 | 0.6198 | 0.6535 |
(These results are from the official scoring.py
script run on the development set).
โ ๏ธ Limitations and Bias
- Performance on Hadith: The model's primary challenge is identifying Hadith texts, which have significantly more linguistic and structural variety than Quranic verses. The F1-score for the
Hadith
class is lower than forAyah
, indicating it may miss or misclassify some prophetic sayings. - Template Dependency: The model's knowledge is based on the rule-based templates used for training. It may be less effective at identifying citations that appear in highly novel or unconventional contexts not represented in the training data.
- Scope: This model identifies intended citations, as per the shared task rules. It does not verify the authenticity or correctness of the citation itself. An LLM could generate a completely fabricated verse, and this model would still identify it if it is presented like a real one.
โ๏ธ Citation
If you use this model or the methodology in your research, please cite the paper:
Coming soon
- Downloads last month
- 9
Model tree for FatimahEmadEldin/Isnad-AI-Identifying-Islamic-Citation
Base model
aubmindlab/bert-base-arabertv02