--- license: mit title: CRAWLGPT sdk: docker emoji: ๐Ÿ’ป colorFrom: pink colorTo: blue pinned: true short_description: A powerful web content crawler with LLM-powered RAG. --- # CrawlGPT ๐Ÿค– A powerful web content crawler with LLM-powered RAG (Retrieval Augmented Generation) capabilities. CrawlGPT extracts content from URLs, processes it through intelligent summarization, and enables natural language interactions using modern LLM technology. ## ๐ŸŒŸ Key Features ### Core Features - **Intelligent Web Crawling** - Async web content extraction using Playwright - Smart rate limiting and validation - Configurable crawling strategies - **Advanced Content Processing** - Automatic text chunking and summarization - Vector embeddings via FAISS - Context-aware response generation - **Streamlit Chat Interface** - Clean, responsive UI - Real-time content processing - Conversation history - User authentication ### Technical Features - **Vector Database** - FAISS-powered similarity search - Efficient content retrieval - Persistent storage - **User Management** - SQLite database backend - Secure password hashing - Chat history tracking - **Monitoring & Utils** - Request metrics collection - Progress tracking - Data import/export - Content validation ## ๐ŸŽฅ Demo ### [Deployed APP ๐Ÿš€๐Ÿค–](https://huggingface.co/spaces/jatinmehra/CRAWL-GPT-CHAT) [streamlit-chat_app video.webm](https://github.com/user-attachments/assets/ae1ddca0-9e3e-4b00-bf21-e73bb8e6cfdf) _Example of CRAWLGPT in action!_ ## ๐Ÿ”ง Requirements - Python >= 3.8 - Operating System: OS Independent - Required packages are handled by the setup script. ## ๐Ÿš€ Quick Start 1. Clone the Repository: ```git clone https://github.com/Jatin-Mehra119/CRAWLGPT.git cd CRAWLGPT ``` 2. Run the Setup Script: ``` python -m setup_env ``` _This script installs dependencies, creates a virtual environment, and prepares the project._ 3. Update Your Environment Variables: - Create or modify the `.env` file. - Add your Groq API key and Ollama API key. Learn how to get API keys. ``` GROQ_API_KEY=your_groq_api_key_here OLLAMA_API_TOKEN=your_ollama_api_key_here ``` 4. Activate the Virtual Environment: ``` source .venv/bin/activate # On Unix/macOS .venv\Scripts\activate # On Windows ``` 5. Run the Application: ``` python -m streamlit run src/crawlgpt/ui/chat_app.py ``` ## ๐Ÿ“ฆ Dependencies ### Core Dependencies - `streamlit==1.41.1` - `groq==0.15.0` - `sentence-transformers==3.3.1` - `faiss-cpu==1.9.0.post1` - `crawl4ai==0.4.247` - `python-dotenv==1.0.1` - `pydantic==2.10.5` - `aiohttp==3.11.11` - `beautifulsoup4==4.12.3` - `numpy==2.2.0` - `tqdm==4.67.1` - `playwright>=1.41.0` - `asyncio>=3.4.3` ### Development Dependencies - `pytest==8.3.4` - `pytest-mockito==0.0.4` - `black==24.2.0` - `isort==5.13.0` - `flake8==7.0.0` ## ๐Ÿ—๏ธ Project Structure ``` crawlgpt/ โ”œโ”€โ”€ src/ โ”‚ โ””โ”€โ”€ crawlgpt/ โ”‚ โ”œโ”€โ”€ core/ # Core functionality โ”‚ โ”‚ โ”œโ”€โ”€ database.py # SQL database handling โ”‚ โ”‚ โ”œโ”€โ”€ LLMBasedCrawler.py # Main crawler implementation โ”‚ โ”‚ โ”œโ”€โ”€ DatabaseHandler.py # Vector database (FAISS) โ”‚ โ”‚ โ””โ”€โ”€ SummaryGenerator.py # Text summarization โ”‚ โ”œโ”€โ”€ ui/ # User Interface โ”‚ โ”‚ โ”œโ”€โ”€ chat_app.py # Main Streamlit app โ”‚ โ”‚ โ”œโ”€โ”€ chat_ui.py # Development UI โ”‚ โ”‚ โ””โ”€โ”€ login.py # Authentication UI โ”‚ โ””โ”€โ”€ utils/ # Utilities โ”‚ โ”œโ”€โ”€ content_validator.py # URL/content validation โ”‚ โ”œโ”€โ”€ data_manager.py # Import/export handling โ”‚ โ”œโ”€โ”€ helper_functions.py # General helpers โ”‚ โ”œโ”€โ”€ monitoring.py # Metrics collection โ”‚ โ””โ”€โ”€ progress.py # Progress tracking โ”œโ”€โ”€ tests/ # Test suite โ”‚ โ””โ”€โ”€ test_core/ โ”‚ โ”œโ”€โ”€ test_database_handler.py # Vector DB tests โ”‚ โ”œโ”€โ”€ test_integration.py # Integration tests โ”‚ โ”œโ”€โ”€ test_llm_based_crawler.py # Crawler tests โ”‚ โ””โ”€โ”€ test_summary_generator.py # Summarizer tests โ”œโ”€โ”€ .github/ # CI/CD โ”‚ โ””โ”€โ”€ workflows/ โ”‚ โ””โ”€โ”€ Push_to_hf.yaml # HuggingFace sync โ”œโ”€โ”€ Docs/ โ”‚ โ””โ”€โ”€ MiniDoc.md # Documentation โ”œโ”€โ”€ .dockerignore # Docker exclusions โ”œโ”€โ”€ .gitignore # Git exclusions โ”œโ”€โ”€ Dockerfile # Container config โ”œโ”€โ”€ LICENSE # MIT License โ”œโ”€โ”€ README.md # Project documentation โ”œโ”€โ”€ README_hf.md # HuggingFace README โ”œโ”€โ”€ pyproject.toml # Project metadata โ”œโ”€โ”€ pytest.ini # Test configuration โ””โ”€โ”€ setup_env.py # Environment setup ``` ## ๐Ÿงช Testing Run all tests ``` python -m pytest ``` _The tests include unit tests for core functionality and integration tests for end-to-end workflows._ ## ๐Ÿ“ License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## ๐Ÿ”— Links - [Bug Tracker](https://github.com/Jatin-Mehra119/crawlgpt/issues) - [Documentation](https://github.com/Jatin-Mehra119/crawlgpt/wiki) - [Source Code](https://github.com/Jatin-Mehra119/crawlgpt) ## ๐Ÿงก Acknowledgments - Inspired by the potential of GPT models for intelligent content processing. - Special thanks to the creators of Crawl4ai, Groq, FAISS, and Playwright for their powerful tools. ## ๐Ÿ‘จโ€๐Ÿ’ป Author - Jatin Mehra (jatinmehra@outlook.in) ## ๐Ÿค Contributing Contributions are welcome! Please feel free to submit a Pull Request. For major changes, open an issue first to discuss your proposal. 1. Fork the Project. 2. Create your Feature Branch: ``` git checkout -b feature/AmazingFeature` ``` 3. Commit your Changes: ``` git commit -m 'Add some AmazingFeature ``` 4. Push to the Branch: ``` git push origin feature/AmazingFeature ``` 5. Open a Pull Request.