# CrawlGPT Documentation ## Overview CrawlGPT is a web content crawler with GPT-powered summarization and chat capabilities. It extracts content from URLs, stores it in a vector database, and enables natural language querying of the stored content. ## Project Structure ``` crawlgpt/ ├── src/ │ └── crawlgpt/ │ ├── core/ │ │ ├── DatabaseHandler.py │ │ ├── LLMBasedCrawler.py │ │ └── SummaryGenerator.py │ ├── ui/ │ │ ├── chat_app.py │ │ └── chat_ui.py │ └── utils/ │ ├── content_validator.py │ ├── data_manager.py │ ├── helper_functions.py │ ├── monitoring.py │ └── progress.py ├── tests/ │ └── test_core/ │ ├── test_database_handler.py │ ├── test_integration.py │ ├── test_llm_based_crawler.py │ └── test_summary_generator.py ├── .gitignore ├── LICENSE ├── README.md ├── Docs ├── pyproject.toml ├── pytest.ini └── setup_env.py ``` ## Core Components ### [LLMBasedCrawler](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/core/LLMBasedCrawler.py) (src/crawlgpt/core/LLMBasedCrawler.py) - Main crawler class handling web content extraction and processing - Integrates with Groq API for language model operations - Manages content chunking, summarization and response generation - Includes rate limiting and metrics collection ### [DatabaseHandler](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/core/DatabaseHandler.py) (src/crawlgpt/core/DatabaseHandler.py) - Vector database implementation using FAISS - Stores and retrieves text embeddings for efficient similarity search - Handles data persistence and state management ### [SummaryGenerator](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/core/SummaryGenerator.py) (src/crawlgpt/core/SummaryGenerator.py) - Generates concise summaries of text chunks using Groq API - Configurable model selection and parameters - Handles empty input validation ## UI Components ### [chat_app.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/ui/chat_app.py) (src/crawlgpt/ui/chat_app.py) - Main Streamlit application interface - URL processing and content extraction - Chat interface with message history - System metrics and debug information - Import/export functionality ### [chat_ui.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/ui/chat_ui.py) (src/crawlgpt/ui/chat_ui.py) - Development/testing UI with additional debug features - Extended metrics visualization - Raw data inspection capabilities ## Utilities ### [content_validator.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/utils/content_validator.py) (src/crawlgpt/utils/content_validator.py) - URL and content validation - MIME type checking - Size limit enforcement - Security checks for malicious content ### [data_manager.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/utils/data_manager.py) (src/crawlgpt/utils/data_manager.py) - Data import/export operations - File serialization (JSON/pickle) - Timestamped backups - State management ### [monitoring.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/utils/monitoring.py) (src/crawlgpt/utils/monitoring.py) - Request metrics collection - Rate limiting implementation - Performance monitoring - Usage statistics ### [progress.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/utils/progress.py) (src/crawlgpt/utils/progress.py) - Operation progress tracking - Status updates - Step counting - Time tracking ## Testing ### [test_database_handler.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/tests/test_core/test_database_handler.py) (tests/test_core/test_database_handler.py) - Tests for vector database operations - Integration tests for data storage/retrieval - End-to-end flow validation ### [test_integration.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/tests/test_core/test_integration.py) (tests/test_core/test_integration.py) - Full system integration tests - URL extraction to response generation flow - State management validation ### [test_llm_based_crawler.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/tests/test_core/test_llm_based_crawler.py) (tests/test_core/test_llm_based_crawler.py) - Crawler functionality tests - Content extraction validation - Response generation testing ### [test_summary_generator.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/tests/test_core/test_summary_generator.py) (tests/test_core/test_summary_generator.py) - Summary generation tests - Empty input handling - Model output validation ## Configuration ### [pyproject.toml](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/pyproject.toml) - Project metadata - Dependencies - Optional dev dependencies - Entry points ### [pytest.ini](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/pytest.ini) - Test configuration - Path settings - Test discovery patterns - Reporting options ### [setup_env.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/setup_env.py) - Environment setup script - Virtual environment creation - Dependency installation - Playwright setup ## Features 1. **Web Crawling** - Async web content extraction - Playwright-based rendering - Content validation - Rate limiting 2. **Content Processing** - Text chunking - Vector embeddings - Summarization - Similarity search 3. **Chat Interface** - Message history - Context management - Model parameter control - Debug information 4. **Data Management** - State import/export - Progress tracking - Metrics collection - Error handling 5. **Testing** - Unit tests - Integration tests - Mock implementations - Async test support ## Dependencies Core: - streamlit - groq - sentence-transformers - faiss-cpu - crawl4ai - pydantic - aiohttp - beautifulsoup4 - playwright Development: - pytest - pytest-mockito - black - isort - flake8 ## License MIT License