Bayaan - Advanced Quran Tafseer Search with AI Vector Models

📖 Overview

Bayaan is an AI-powered Quran Tafseer search system that uses multiple machine learning models to find relevant Islamic interpretations from 219,000 records across 84 scholarly books. It automatically picks the best AI approach for your query - simple keywords use TF-IDF, contextual searches use Word2Vec/BERT, making it like having an intelligent Islamic library at your fingertips.

🛠️ Tech Stack

Flask - REST API framework
scikit-learn - TF-IDF, cosine similarity
SentenceTransformers - Omartificial-Intelligence-Space/Arabic-Triplet-Matryoshka-V2 for semantic search
BERT/Word2Vec - Semantic embeddings
pandas/numpy - Data processing
Dataset: 219K Tafseer records from Altafsir.com

🗃️ The Dataset: A Treasure Trove of Islamic Knowledge

Bayaan is powered by the comprehensive Quran-Tafseer dataset from Hugging Face, created by MohamedRashad. This dataset is a goldmine for anyone interested in Islamic studies, natural language processing, or understanding the Quran's deeper meanings.

Dataset Highlights:

📚 84 Different Tafseer Books - From classical to contemporary scholars
📊 219,000 Rows of rich interpretative content
🌍 Source: All data collected from Altafsir.com
🔤 Language: Arabic (with English query support through AI)

What's Inside:

Column	Description	Example
`surah_name`	Name of the Quran chapter	"Al-Fatiha", "Al-Baqarah"
`revelation_type`	Where the Surah was revealed	"Meccan" or "Medinan"
`ayah`	The specific Quranic verse	"بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ"
`tafsir_book`	Source of the interpretation	"Ibn Kathir", "Al-Jalalayn"
`tafsir_content`	The actual scholarly commentary	Detailed Arabic interpretation

🤖 How Bayaan Makes It Smart

Bayaan doesn't just do keyword matching - it understands context, meaning, and relationships between concepts using multiple AI approaches:

🌟 Key Features

🤖 Multi-Model AI Search

TF-IDF Vectorization: Optimized for short queries (≤2 words)
Word2Vec Embeddings: Perfect for medium-length queries (≤10 words)
BERT Transformers: Advanced semantic understanding for long queries (>10 words)
SentenceTransformers: State-of-the-art Arabic language model for advanced search

🎯 Intelligent Query Routing

Hybrid Search Algorithm: Automatically selects the best AI model based on query characteristics
Fallback Mechanisms: Ensures reliable results even when specific models are unavailable
Contextual Understanding: Semantic similarity matching beyond keyword matching

🔍 Advanced Search Capabilities

Semantic Search: Find conceptually similar content, not just keyword matches
Multi-field Search: Search across Ayahs, Tafseer content, Surah names, and more
Similarity Scoring: Ranked results with confidence scores
Flexible Result Limits: Configurable result counts (1-50 results)

🏗️ System Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Query Input   │───▶│  Hybrid Router   │───▶│   AI Models     │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                              │                         │
                              ▼                         ▼
                    ┌──────────────────┐    ┌─────────────────────┐
                    │  Query Analysis  │    │  • TF-IDF Matrix    │
                    │  - Length Check  │    │  • Word2Vec Vectors │
                    │  - Complexity    │    │  • BERT Embeddings  │
                    │  - Language      │    │  • SentenceTransform│
                    └──────────────────┘    └─────────────────────┘
                              │                         │
                              ▼                         ▼
                    ┌──────────────────┐    ┌─────────────────────┐
                    │  Result Ranking  │◀───│  Similarity Engine  │
                    │  - Cosine Sim    │    │  - Vector Matching  │
                    │  - Score Fusion  │    │  - Context Analysis │
                    └──────────────────┘    └─────────────────────┘

📊 Required Data Files

File	Description	Required	Size
`tafseer.csv`	Main Tafseer dataset	✅ Yes	Variable
`w2v_vectors.npy`	Pre-computed Word2Vec embeddings	⚠️ Optional	~100MB
`bert_vectors.npy`	Pre-computed BERT embeddings	⚠️ Optional	~200MB
`tafsir_embeddings.npy`	SentenceTransformer embeddings	⚠️ Optional	~300MB

musabalosimi
/

bayaan