Spaces:
Running
license: mit
title: CRAWLGPT
sdk: docker
emoji: π»
colorFrom: pink
colorTo: blue
pinned: true
short_description: A powerful web content crawler with LLM-powered RAG.
CrawlGPT π€
A powerful web content crawler with LLM-powered RAG (Retrieval Augmented Generation) capabilities. CrawlGPT extracts content from URLs, processes it through intelligent summarization, and enables natural language interactions using modern LLM technology.
π Key Features
Core Features
Intelligent Web Crawling
- Async web content extraction using Playwright
- Smart rate limiting and validation
- Configurable crawling strategies
Advanced Content Processing
- Automatic text chunking and summarization
- Vector embeddings via FAISS
- Context-aware response generation
Streamlit Chat Interface
- Clean, responsive UI
- Real-time content processing
- Conversation history
- User authentication
Technical Features
Vector Database
- FAISS-powered similarity search
- Efficient content retrieval
- Persistent storage
User Management
- SQLite database backend
- Secure password hashing
- Chat history tracking
Monitoring & Utils
- Request metrics collection
- Progress tracking
- Data import/export
- Content validation
π₯ Demo
Deployed APP ππ€
Example of CRAWLGPT in action!
π§ Requirements
- Python >= 3.8
- Operating System: OS Independent
- Required packages are handled by the setup script.
π Quick Start
Clone the Repository:
cd CRAWLGPT
Run the Setup Script:
python -m setup_env
This script installs dependencies, creates a virtual environment, and prepares the project.
Update Your Environment Variables:
- Create or modify the
.env
file. - Add your Groq API key and Ollama API key. Learn how to get API keys.
GROQ_API_KEY=your_groq_api_key_here OLLAMA_API_TOKEN=your_ollama_api_key_here
- Create or modify the
Activate the Virtual Environment:
source .venv/bin/activate # On Unix/macOS .venv\Scripts\activate # On Windows
Run the Application:
python -m streamlit run src/crawlgpt/ui/chat_app.py
π¦ Dependencies
Core Dependencies
streamlit==1.41.1
groq==0.15.0
sentence-transformers==3.3.1
faiss-cpu==1.9.0.post1
crawl4ai==0.4.247
python-dotenv==1.0.1
pydantic==2.10.5
aiohttp==3.11.11
beautifulsoup4==4.12.3
numpy==2.2.0
tqdm==4.67.1
playwright>=1.41.0
asyncio>=3.4.3
Development Dependencies
pytest==8.3.4
pytest-mockito==0.0.4
black==24.2.0
isort==5.13.0
flake8==7.0.0
ποΈ Project Structure
crawlgpt/
βββ src/
β βββ crawlgpt/
β βββ core/ # Core functionality
β β βββ database.py # SQL database handling
β β βββ LLMBasedCrawler.py # Main crawler implementation
β β βββ DatabaseHandler.py # Vector database (FAISS)
β β βββ SummaryGenerator.py # Text summarization
β βββ ui/ # User Interface
β β βββ chat_app.py # Main Streamlit app
β β βββ chat_ui.py # Development UI
β β βββ login.py # Authentication UI
β βββ utils/ # Utilities
β βββ content_validator.py # URL/content validation
β βββ data_manager.py # Import/export handling
β βββ helper_functions.py # General helpers
β βββ monitoring.py # Metrics collection
β βββ progress.py # Progress tracking
βββ tests/ # Test suite
β βββ test_core/
β βββ test_database_handler.py # Vector DB tests
β βββ test_integration.py # Integration tests
β βββ test_llm_based_crawler.py # Crawler tests
β βββ test_summary_generator.py # Summarizer tests
βββ .github/ # CI/CD
β βββ workflows/
β βββ Push_to_hf.yaml # HuggingFace sync
βββ Docs/
β βββ MiniDoc.md # Documentation
βββ .dockerignore # Docker exclusions
βββ .gitignore # Git exclusions
βββ Dockerfile # Container config
βββ LICENSE # MIT License
βββ README.md # Project documentation
βββ README_hf.md # HuggingFace README
βββ pyproject.toml # Project metadata
βββ pytest.ini # Test configuration
βββ setup_env.py # Environment setup
π§ͺ Testing
Run all tests
python -m pytest
The tests include unit tests for core functionality and integration tests for end-to-end workflows.
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π Links
π§‘ Acknowledgments
- Inspired by the potential of GPT models for intelligent content processing.
- Special thanks to the creators of Crawl4ai, Groq, FAISS, and Playwright for their powerful tools.
π¨βπ» Author
- Jatin Mehra ([email protected])
π€ Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, open an issue first to discuss your proposal.
- Fork the Project.
- Create your Feature Branch:
git checkout -b feature/AmazingFeature`
- Commit your Changes:
git commit -m 'Add some AmazingFeature
- Push to the Branch:
git push origin feature/AmazingFeature
- Open a Pull Request.