metadata

license: mit
title: CRAWLGPT
sdk: docker
emoji: 💻
colorFrom: pink
colorTo: blue
pinned: true
short_description: A powerful web content crawler with LLM-powered RAG.

CrawlGPT 🤖

A powerful web content crawler with LLM-powered RAG (Retrieval Augmented Generation) capabilities. CrawlGPT extracts content from URLs, processes it through intelligent summarization, and enables natural language interactions using modern LLM technology.

🌟 Key Features

Core Features

Intelligent Web Crawling
- Async web content extraction using Playwright
- Smart rate limiting and validation
- Configurable crawling strategies
Advanced Content Processing
- Automatic text chunking and summarization
- Vector embeddings via FAISS
- Context-aware response generation
Streamlit Chat Interface
- Clean, responsive UI
- Real-time content processing
- Conversation history
- User authentication

Technical Features

Vector Database
- FAISS-powered similarity search
- Efficient content retrieval
- Persistent storage
User Management
- SQLite database backend
- Secure password hashing
- Chat history tracking
Monitoring & Utils
- Request metrics collection
- Progress tracking
- Data import/export
- Content validation

🎥 Demo

Deployed APP 🚀🤖

streamlit-chat_app video.webm

Example of CRAWLGPT in action!

🔧 Requirements

Python >= 3.8
Operating System: OS Independent
Required packages are handled by the setup script.

🚀 Quick Start

Clone the Repository:
```
cd CRAWLGPT
```
Run the Setup Script:
```
python -m setup_env
```
This script installs dependencies, creates a virtual environment, and prepares the project.
Update Your Environment Variables:
- Create or modify the .env file.
- Add your Groq API key and Ollama API key. Learn how to get API keys.
```
GROQ_API_KEY=your_groq_api_key_here
OLLAMA_API_TOKEN=your_ollama_api_key_here
```

Activate the Virtual Environment:

source .venv/bin/activate  # On Unix/macOS
.venv\Scripts\activate  # On Windows

Run the Application:

python -m streamlit run src/crawlgpt/ui/chat_app.py

📦 Dependencies

Core Dependencies

streamlit==1.41.1
groq==0.15.0
sentence-transformers==3.3.1
faiss-cpu==1.9.0.post1
crawl4ai==0.4.247
python-dotenv==1.0.1
pydantic==2.10.5
aiohttp==3.11.11
beautifulsoup4==4.12.3
numpy==2.2.0
tqdm==4.67.1
playwright>=1.41.0
asyncio>=3.4.3

Development Dependencies

pytest==8.3.4
pytest-mockito==0.0.4
black==24.2.0
isort==5.13.0
flake8==7.0.0

🏗️ Project Structure

crawlgpt/
├── src/
│   └── crawlgpt/
│       ├── core/                           # Core functionality
│       │   ├── database.py                 # SQL database handling
│       │   ├── LLMBasedCrawler.py         # Main crawler implementation
│       │   ├── DatabaseHandler.py          # Vector database (FAISS)
│       │   └── SummaryGenerator.py         # Text summarization
│       ├── ui/                            # User Interface
│       │   ├── chat_app.py                # Main Streamlit app
│       │   ├── chat_ui.py                 # Development UI
│       │   └── login.py                   # Authentication UI
│       └── utils/                         # Utilities
│           ├── content_validator.py        # URL/content validation
│           ├── data_manager.py            # Import/export handling
│           ├── helper_functions.py         # General helpers
│           ├── monitoring.py              # Metrics collection
│           └── progress.py                # Progress tracking
├── tests/                                # Test suite
│   └── test_core/
│       ├── test_database_handler.py       # Vector DB tests
│       ├── test_integration.py           # Integration tests
│       ├── test_llm_based_crawler.py     # Crawler tests
│       └── test_summary_generator.py     # Summarizer tests
├── .github/                             # CI/CD
│   └── workflows/
│       └── Push_to_hf.yaml              # HuggingFace sync
├── Docs/
│   └── MiniDoc.md                       # Documentation
├── .dockerignore                        # Docker exclusions
├── .gitignore                          # Git exclusions
├── Dockerfile                          # Container config
├── LICENSE                             # MIT License
├── README.md                          # Project documentation
├── README_hf.md                       # HuggingFace README
├── pyproject.toml                     # Project metadata
├── pytest.ini                         # Test configuration
└── setup_env.py                       # Environment setup

🧪 Testing

Run all tests

python -m pytest

The tests include unit tests for core functionality and integration tests for end-to-end workflows.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔗 Links

🧡 Acknowledgments

Inspired by the potential of GPT models for intelligent content processing.
Special thanks to the creators of Crawl4ai, Groq, FAISS, and Playwright for their powerful tools.

👨‍💻 Author

Jatin Mehra ([email protected])

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, open an issue first to discuss your proposal.

Fork the Project.

Create your Feature Branch:

git checkout -b feature/AmazingFeature`

Commit your Changes:
```
git commit -m 'Add some AmazingFeature
```
Push to the Branch:
```
git push origin feature/AmazingFeature
```
Open a Pull Request.