Bayaan - Advanced Quran Tafseer Search with AI Vector Models
๐ Overview
Bayaan is an AI-powered Quran Tafseer search system that uses multiple machine learning models to find relevant Islamic interpretations from 219,000 records across 84 scholarly books. It automatically picks the best AI approach for your query - simple keywords use TF-IDF, contextual searches use Word2Vec/BERT, making it like having an intelligent Islamic library at your fingertips.
๐ ๏ธ Tech Stack
- Flask - REST API framework
- scikit-learn - TF-IDF, cosine similarity
- SentenceTransformers - Omartificial-Intelligence-Space/Arabic-Triplet-Matryoshka-V2 for semantic search
- BERT/Word2Vec - Semantic embeddings
- pandas/numpy - Data processing
- Dataset: 219K Tafseer records from Altafsir.com
๐๏ธ The Dataset: A Treasure Trove of Islamic Knowledge
Bayaan is powered by the comprehensive Quran-Tafseer dataset from Hugging Face, created by MohamedRashad. This dataset is a goldmine for anyone interested in Islamic studies, natural language processing, or understanding the Quran's deeper meanings.
Dataset Highlights:
- ๐ 84 Different Tafseer Books - From classical to contemporary scholars
- ๐ 219,000 Rows of rich interpretative content
- ๐ Source: All data collected from Altafsir.com
- ๐ค Language: Arabic (with English query support through AI)
What's Inside:
Column | Description | Example |
---|---|---|
surah_name |
Name of the Quran chapter | "Al-Fatiha", "Al-Baqarah" |
revelation_type |
Where the Surah was revealed | "Meccan" or "Medinan" |
ayah |
The specific Quranic verse | "ุจูุณูู ู ุงูููููู ุงูุฑููุญูู ููฐูู ุงูุฑููุญููู ู" |
tafsir_book |
Source of the interpretation | "Ibn Kathir", "Al-Jalalayn" |
tafsir_content |
The actual scholarly commentary | Detailed Arabic interpretation |
๐ค How Bayaan Makes It Smart
Bayaan doesn't just do keyword matching - it understands context, meaning, and relationships between concepts using multiple AI approaches:
๐ Key Features
๐ค Multi-Model AI Search
- TF-IDF Vectorization: Optimized for short queries (โค2 words)
- Word2Vec Embeddings: Perfect for medium-length queries (โค10 words)
- BERT Transformers: Advanced semantic understanding for long queries (>10 words)
- SentenceTransformers: State-of-the-art Arabic language model for advanced search
๐ฏ Intelligent Query Routing
- Hybrid Search Algorithm: Automatically selects the best AI model based on query characteristics
- Fallback Mechanisms: Ensures reliable results even when specific models are unavailable
- Contextual Understanding: Semantic similarity matching beyond keyword matching
๐ Advanced Search Capabilities
- Semantic Search: Find conceptually similar content, not just keyword matches
- Multi-field Search: Search across Ayahs, Tafseer content, Surah names, and more
- Similarity Scoring: Ranked results with confidence scores
- Flexible Result Limits: Configurable result counts (1-50 results)
๐๏ธ System Architecture
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ Query Input โโโโโถโ Hybrid Router โโโโโถโ AI Models โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ โ
โผ โผ
โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ
โ Query Analysis โ โ โข TF-IDF Matrix โ
โ - Length Check โ โ โข Word2Vec Vectors โ
โ - Complexity โ โ โข BERT Embeddings โ
โ - Language โ โ โข SentenceTransformโ
โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โผ โผ
โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ
โ Result Ranking โโโโโโ Similarity Engine โ
โ - Cosine Sim โ โ - Vector Matching โ
โ - Score Fusion โ โ - Context Analysis โ
โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ
๐ Required Data Files
File | Description | Required | Size |
---|---|---|---|
tafseer.csv |
Main Tafseer dataset | โ Yes | Variable |
w2v_vectors.npy |
Pre-computed Word2Vec embeddings | โ ๏ธ Optional | ~100MB |
bert_vectors.npy |
Pre-computed BERT embeddings | โ ๏ธ Optional | ~200MB |
tafsir_embeddings.npy |
SentenceTransformer embeddings | โ ๏ธ Optional | ~300MB |
Model tree for musabalosimi/bayaan
Base model
aubmindlab/bert-base-arabertv02