--- title: Legal Dashboard OCR System sdk: docker emoji: ๐Ÿš€ colorFrom: indigo colorTo: yellow pinned: true --- # Legal Dashboard OCR System AI-powered Persian legal document processing system with advanced OCR capabilities using Hugging Face models. ## ๐Ÿš€ Features - **Advanced OCR Processing**: Hugging Face TrOCR models for Persian text extraction - **AI-Powered Scoring**: Intelligent document quality assessment and scoring - **Automatic Categorization**: AI-driven document category prediction - **Real-time Dashboard**: Live analytics and document management - **WebSocket Support**: Real-time updates and notifications - **Comprehensive API**: RESTful API for all operations - **Persian Language Support**: Optimized for Persian/Farsi legal documents ## ๐Ÿ—๏ธ Architecture ``` legal_dashboard_ocr/ โ”œโ”€โ”€ app/ # Backend application โ”‚ โ”œโ”€โ”€ main.py # FastAPI entry point โ”‚ โ”œโ”€โ”€ api/ # API route handlers โ”‚ โ”‚ โ”œโ”€โ”€ documents.py # Document CRUD operations โ”‚ โ”‚ โ”œโ”€โ”€ ocr.py # OCR processing endpoints โ”‚ โ”‚ โ””โ”€โ”€ dashboard.py # Dashboard analytics โ”‚ โ”œโ”€โ”€ services/ # Business logic services โ”‚ โ”‚ โ”œโ”€โ”€ ocr_service.py # OCR pipeline โ”‚ โ”‚ โ”œโ”€โ”€ database_service.py # Database operations โ”‚ โ”‚ โ””โ”€โ”€ ai_service.py # AI scoring engine โ”‚ โ””โ”€โ”€ models/ # Data models โ”‚ โ””โ”€โ”€ document_models.py โ”œโ”€โ”€ frontend/ # Web interface โ”‚ โ”œโ”€โ”€ improved_legal_dashboard.html โ”‚ โ””โ”€โ”€ test_integration.html โ”œโ”€โ”€ tests/ # Test suite โ”‚ โ”œโ”€โ”€ test_api_endpoints.py โ”‚ โ””โ”€โ”€ test_ocr_pipeline.py โ”œโ”€โ”€ data/ # Sample documents โ”‚ โ””โ”€โ”€ sample_persian.pdf โ”œโ”€โ”€ huggingface_space/ # HF Space deployment โ”‚ โ”œโ”€โ”€ app.py # Gradio interface โ”‚ โ”œโ”€โ”€ Spacefile # Deployment config โ”‚ โ””โ”€โ”€ README.md # Space documentation โ””โ”€โ”€ requirements.txt # Dependencies ``` ## ๐Ÿ› ๏ธ Installation ### Prerequisites - Python 3.10+ - pip - Git ### Setup 1. **Clone the repository** ```bash git clone cd legal_dashboard_ocr ``` 2. **Install dependencies** ```bash pip install -r requirements.txt ``` 3. **Set up environment variables** ```bash # Create .env file echo "HF_TOKEN=your_huggingface_token" > .env ``` 4. **Run the application** ```bash # Start the FastAPI server uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload ``` 5. **Access the application** - Web Dashboard: http://localhost:8000 - API Documentation: http://localhost:8000/docs - Health Check: http://localhost:8000/health ## ๐Ÿ“– Usage ### Web Interface 1. **Upload PDF**: Navigate to the dashboard and upload a Persian legal document 2. **Process Document**: Click "Process PDF" to extract text using OCR 3. **Review Results**: View extracted text, AI analysis, and quality metrics 4. **Save Document**: Optionally save processed documents to the database 5. **View Analytics**: Check dashboard statistics and trends ### API Usage #### Process PDF with OCR ```bash curl -X POST "http://localhost:8000/api/ocr/process" \ -H "Content-Type: multipart/form-data" \ -F "file=@document.pdf" ``` #### Get Documents ```bash curl "http://localhost:8000/api/documents?limit=10&offset=0" ``` #### Create Document ```bash curl -X POST "http://localhost:8000/api/documents/" \ -H "Content-Type: application/json" \ -d '{ "title": "Legal Document", "full_text": "Extracted text content", "source": "Uploaded", "category": "ู‚ุงู†ูˆู†" }' ``` #### Get Dashboard Summary ```bash curl "http://localhost:8000/api/dashboard/summary" ``` ## ๐Ÿ”ง Configuration ### OCR Models The system supports multiple Hugging Face OCR models: - `microsoft/trocr-base-stage1`: Default model for printed text - `microsoft/trocr-base-handwritten`: For handwritten text - `microsoft/trocr-large-stage1`: Higher accuracy model ### AI Scoring Weights The AI scoring engine uses configurable weights: - Keyword Relevance: 30% - Document Completeness: 25% - Recency: 20% - Source Credibility: 15% - Document Quality: 10% ### Database SQLite database with tables for: - Documents - AI training data - System metrics ## ๐Ÿงช Testing ### Run Tests ```bash # Run all tests python -m pytest tests/ # Run specific test python -m pytest tests/test_api_endpoints.py # Run with coverage python -m pytest tests/ --cov=app ``` ### Test Coverage - API endpoint testing - OCR pipeline validation - Database operations - AI scoring accuracy - Frontend integration ## ๐Ÿš€ Deployment ### Hugging Face Spaces 1. **Create a new Space** on Hugging Face 2. **Upload the project** files 3. **Set environment variables**: - `HF_TOKEN`: Your Hugging Face token 4. **Deploy** the Space ### Docker Deployment ```dockerfile FROM python:3.10-slim WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt COPY . . EXPOSE 8000 CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"] ``` ### Production Deployment 1. **Set up a production server** 2. **Install dependencies** 3. **Configure environment variables** 4. **Set up reverse proxy** (nginx) 5. **Run with gunicorn**: ```bash gunicorn app.main:app -w 4 -k uvicorn.workers.UvicornWorker ``` ## ๐Ÿ“Š API Documentation ### Endpoints #### Documents - `GET /api/documents/` - List documents - `POST /api/documents/` - Create document - `GET /api/documents/{id}` - Get document - `PUT /api/documents/{id}` - Update document - `DELETE /api/documents/{id}` - Delete document #### OCR - `POST /api/ocr/process` - Process PDF - `POST /api/ocr/process-and-save` - Process and save - `POST /api/ocr/batch-process` - Batch processing - `GET /api/ocr/status` - OCR status #### Dashboard - `GET /api/dashboard/summary` - Dashboard summary - `GET /api/dashboard/charts-data` - Chart data - `GET /api/dashboard/ai-suggestions` - AI suggestions - `POST /api/dashboard/ai-feedback` - Submit feedback ### Response Formats All API responses follow standard JSON format with: - Success/error status - Data payload - Metadata (timestamps, pagination, etc.) ## ๐Ÿ”’ Security ### Authentication - API key authentication for production - Rate limiting on endpoints - Input validation and sanitization ### Data Protection - Secure file upload handling - Temporary file cleanup - Database connection security ## ๐Ÿค Contributing 1. **Fork the repository** 2. **Create a feature branch** 3. **Make your changes** 4. **Add tests** for new functionality 5. **Submit a pull request** ### Development Guidelines - Follow PEP 8 style guide - Add type hints to functions - Write comprehensive docstrings - Include unit tests - Update documentation ## ๐Ÿ“ License This project is licensed under the MIT License - see the LICENSE file for details. ## ๐Ÿ™ Acknowledgments - Hugging Face for OCR models - FastAPI for the web framework - Gradio for the Space interface - Microsoft for TrOCR models ## ๐Ÿ“ž Support For support and questions: - Create an issue on GitHub - Check the documentation - Review the API docs at `/docs` ## ๐Ÿ”„ Changelog ### v1.0.0 - Initial release - OCR pipeline with Hugging Face models - AI scoring engine - Dashboard interface - RESTful API - Hugging Face Space deployment