CRAWL-GPT-CHAT / README_hf.md
jatinmehra's picture
docs: update MiniDoc and README.md for improved clarity and feature descriptions
d32bdc1
metadata
license: mit
title: CRAWLGPT
sdk: docker
emoji: πŸ’»
colorFrom: pink
colorTo: blue
pinned: true
short_description: A powerful web content crawler with LLM-powered RAG.

CrawlGPT πŸ€–

A powerful web content crawler with LLM-powered RAG (Retrieval Augmented Generation) capabilities. CrawlGPT extracts content from URLs, processes it through intelligent summarization, and enables natural language interactions using modern LLM technology.

🌟 Key Features

Core Features

  • Intelligent Web Crawling

    • Async web content extraction using Playwright
    • Smart rate limiting and validation
    • Configurable crawling strategies
  • Advanced Content Processing

    • Automatic text chunking and summarization
    • Vector embeddings via FAISS
    • Context-aware response generation
  • Streamlit Chat Interface

    • Clean, responsive UI
    • Real-time content processing
    • Conversation history
    • User authentication

Technical Features

  • Vector Database

    • FAISS-powered similarity search
    • Efficient content retrieval
    • Persistent storage
  • User Management

    • SQLite database backend
    • Secure password hashing
    • Chat history tracking
  • Monitoring & Utils

    • Request metrics collection
    • Progress tracking
    • Data import/export
    • Content validation

πŸŽ₯ Demo

Deployed APP πŸš€πŸ€–

streamlit-chat_app video.webm

Example of CRAWLGPT in action!

πŸ”§ Requirements

  • Python >= 3.8
  • Operating System: OS Independent
  • Required packages are handled by the setup script.

πŸš€ Quick Start

  1. Clone the Repository:

    cd CRAWLGPT
    
  2. Run the Setup Script:

    python -m setup_env
    

    This script installs dependencies, creates a virtual environment, and prepares the project.

  3. Update Your Environment Variables:

    • Create or modify the .env file.
    • Add your Groq API key and Ollama API key. Learn how to get API keys.
    GROQ_API_KEY=your_groq_api_key_here
    OLLAMA_API_TOKEN=your_ollama_api_key_here
    
  4. Activate the Virtual Environment:

    source .venv/bin/activate  # On Unix/macOS
    .venv\Scripts\activate  # On Windows
    
  5. Run the Application:

    python -m streamlit run src/crawlgpt/ui/chat_app.py
    

πŸ“¦ Dependencies

Core Dependencies

  • streamlit==1.41.1
  • groq==0.15.0
  • sentence-transformers==3.3.1
  • faiss-cpu==1.9.0.post1
  • crawl4ai==0.4.247
  • python-dotenv==1.0.1
  • pydantic==2.10.5
  • aiohttp==3.11.11
  • beautifulsoup4==4.12.3
  • numpy==2.2.0
  • tqdm==4.67.1
  • playwright>=1.41.0
  • asyncio>=3.4.3

Development Dependencies

  • pytest==8.3.4
  • pytest-mockito==0.0.4
  • black==24.2.0
  • isort==5.13.0
  • flake8==7.0.0

πŸ—οΈ Project Structure

crawlgpt/
β”œβ”€β”€ src/
β”‚   └── crawlgpt/
β”‚       β”œβ”€β”€ core/                           # Core functionality
β”‚       β”‚   β”œβ”€β”€ database.py                 # SQL database handling
β”‚       β”‚   β”œβ”€β”€ LLMBasedCrawler.py         # Main crawler implementation
β”‚       β”‚   β”œβ”€β”€ DatabaseHandler.py          # Vector database (FAISS)
β”‚       β”‚   └── SummaryGenerator.py         # Text summarization
β”‚       β”œβ”€β”€ ui/                            # User Interface
β”‚       β”‚   β”œβ”€β”€ chat_app.py                # Main Streamlit app
β”‚       β”‚   β”œβ”€β”€ chat_ui.py                 # Development UI
β”‚       β”‚   └── login.py                   # Authentication UI
β”‚       └── utils/                         # Utilities
β”‚           β”œβ”€β”€ content_validator.py        # URL/content validation
β”‚           β”œβ”€β”€ data_manager.py            # Import/export handling
β”‚           β”œβ”€β”€ helper_functions.py         # General helpers
β”‚           β”œβ”€β”€ monitoring.py              # Metrics collection
β”‚           └── progress.py                # Progress tracking
β”œβ”€β”€ tests/                                # Test suite
β”‚   └── test_core/
β”‚       β”œβ”€β”€ test_database_handler.py       # Vector DB tests
β”‚       β”œβ”€β”€ test_integration.py           # Integration tests
β”‚       β”œβ”€β”€ test_llm_based_crawler.py     # Crawler tests
β”‚       └── test_summary_generator.py     # Summarizer tests
β”œβ”€β”€ .github/                             # CI/CD
β”‚   └── workflows/
β”‚       └── Push_to_hf.yaml              # HuggingFace sync
β”œβ”€β”€ Docs/
β”‚   └── MiniDoc.md                       # Documentation
β”œβ”€β”€ .dockerignore                        # Docker exclusions
β”œβ”€β”€ .gitignore                          # Git exclusions
β”œβ”€β”€ Dockerfile                          # Container config
β”œβ”€β”€ LICENSE                             # MIT License
β”œβ”€β”€ README.md                          # Project documentation
β”œβ”€β”€ README_hf.md                       # HuggingFace README
β”œβ”€β”€ pyproject.toml                     # Project metadata
β”œβ”€β”€ pytest.ini                         # Test configuration
└── setup_env.py                       # Environment setup

πŸ§ͺ Testing

Run all tests

python -m pytest

The tests include unit tests for core functionality and integration tests for end-to-end workflows.

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ”— Links

🧑 Acknowledgments

  • Inspired by the potential of GPT models for intelligent content processing.
  • Special thanks to the creators of Crawl4ai, Groq, FAISS, and Playwright for their powerful tools.

πŸ‘¨β€πŸ’» Author

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, open an issue first to discuss your proposal.

  1. Fork the Project.
  2. Create your Feature Branch:
    git checkout -b feature/AmazingFeature`
    
  3. Commit your Changes:
    git commit -m 'Add some AmazingFeature
    
  4. Push to the Branch:
    git push origin feature/AmazingFeature
    
  5. Open a Pull Request.