Hoghoghi / README.md
Really-amin's picture
Upload 58 files
84fb503 verified
|
raw
history blame
7.78 kB
metadata
title: Legal Dashboard OCR System
sdk: docker
emoji: πŸš€
colorFrom: indigo
colorTo: yellow
pinned: true

Legal Dashboard OCR System

AI-powered Persian legal document processing system with advanced OCR capabilities using Hugging Face models.

πŸš€ Features

  • Advanced OCR Processing: Hugging Face TrOCR models for Persian text extraction
  • AI-Powered Scoring: Intelligent document quality assessment and scoring
  • Automatic Categorization: AI-driven document category prediction
  • Real-time Dashboard: Live analytics and document management
  • WebSocket Support: Real-time updates and notifications
  • Comprehensive API: RESTful API for all operations
  • Persian Language Support: Optimized for Persian/Farsi legal documents

πŸ—οΈ Architecture

legal_dashboard_ocr/
β”œβ”€β”€ app/                     # Backend application
β”‚   β”œβ”€β”€ main.py             # FastAPI entry point
β”‚   β”œβ”€β”€ api/                # API route handlers
β”‚   β”‚   β”œβ”€β”€ documents.py    # Document CRUD operations
β”‚   β”‚   β”œβ”€β”€ ocr.py         # OCR processing endpoints
β”‚   β”‚   └── dashboard.py   # Dashboard analytics
β”‚   β”œβ”€β”€ services/           # Business logic services
β”‚   β”‚   β”œβ”€β”€ ocr_service.py # OCR pipeline
β”‚   β”‚   β”œβ”€β”€ database_service.py # Database operations
β”‚   β”‚   └── ai_service.py  # AI scoring engine
β”‚   └── models/             # Data models
β”‚       └── document_models.py
β”œβ”€β”€ frontend/               # Web interface
β”‚   β”œβ”€β”€ improved_legal_dashboard.html
β”‚   └── test_integration.html
β”œβ”€β”€ tests/                  # Test suite
β”‚   β”œβ”€β”€ test_api_endpoints.py
β”‚   └── test_ocr_pipeline.py
β”œβ”€β”€ data/                   # Sample documents
β”‚   └── sample_persian.pdf
β”œβ”€β”€ huggingface_space/      # HF Space deployment
β”‚   β”œβ”€β”€ app.py             # Gradio interface
β”‚   β”œβ”€β”€ Spacefile          # Deployment config
β”‚   └── README.md          # Space documentation
└── requirements.txt        # Dependencies

πŸ› οΈ Installation

Prerequisites

  • Python 3.10+
  • pip
  • Git

Setup

  1. Clone the repository

    git clone <repository-url>
    cd legal_dashboard_ocr
    
  2. Install dependencies

    pip install -r requirements.txt
    
  3. Set up environment variables

    # Create .env file
    echo "HF_TOKEN=your_huggingface_token" > .env
    
  4. Run the application

    # Start the FastAPI server
    uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
    
  5. Access the application

πŸ“– Usage

Web Interface

  1. Upload PDF: Navigate to the dashboard and upload a Persian legal document
  2. Process Document: Click "Process PDF" to extract text using OCR
  3. Review Results: View extracted text, AI analysis, and quality metrics
  4. Save Document: Optionally save processed documents to the database
  5. View Analytics: Check dashboard statistics and trends

API Usage

Process PDF with OCR

curl -X POST "http://localhost:8000/api/ocr/process" \
  -H "Content-Type: multipart/form-data" \
  -F "[email protected]"

Get Documents

curl "http://localhost:8000/api/documents?limit=10&offset=0"

Create Document

curl -X POST "http://localhost:8000/api/documents/" \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Legal Document",
    "full_text": "Extracted text content",
    "source": "Uploaded",
    "category": "Ω‚Ψ§Ω†ΩˆΩ†"
  }'

Get Dashboard Summary

curl "http://localhost:8000/api/dashboard/summary"

πŸ”§ Configuration

OCR Models

The system supports multiple Hugging Face OCR models:

  • microsoft/trocr-base-stage1: Default model for printed text
  • microsoft/trocr-base-handwritten: For handwritten text
  • microsoft/trocr-large-stage1: Higher accuracy model

AI Scoring Weights

The AI scoring engine uses configurable weights:

  • Keyword Relevance: 30%
  • Document Completeness: 25%
  • Recency: 20%
  • Source Credibility: 15%
  • Document Quality: 10%

Database

SQLite database with tables for:

  • Documents
  • AI training data
  • System metrics

πŸ§ͺ Testing

Run Tests

# Run all tests
python -m pytest tests/

# Run specific test
python -m pytest tests/test_api_endpoints.py

# Run with coverage
python -m pytest tests/ --cov=app

Test Coverage

  • API endpoint testing
  • OCR pipeline validation
  • Database operations
  • AI scoring accuracy
  • Frontend integration

πŸš€ Deployment

Hugging Face Spaces

  1. Create a new Space on Hugging Face
  2. Upload the project files
  3. Set environment variables:
    • HF_TOKEN: Your Hugging Face token
  4. Deploy the Space

Docker Deployment

FROM python:3.10-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .
EXPOSE 8000

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Production Deployment

  1. Set up a production server
  2. Install dependencies
  3. Configure environment variables
  4. Set up reverse proxy (nginx)
  5. Run with gunicorn:
    gunicorn app.main:app -w 4 -k uvicorn.workers.UvicornWorker
    

πŸ“Š API Documentation

Endpoints

Documents

  • GET /api/documents/ - List documents
  • POST /api/documents/ - Create document
  • GET /api/documents/{id} - Get document
  • PUT /api/documents/{id} - Update document
  • DELETE /api/documents/{id} - Delete document

OCR

  • POST /api/ocr/process - Process PDF
  • POST /api/ocr/process-and-save - Process and save
  • POST /api/ocr/batch-process - Batch processing
  • GET /api/ocr/status - OCR status

Dashboard

  • GET /api/dashboard/summary - Dashboard summary
  • GET /api/dashboard/charts-data - Chart data
  • GET /api/dashboard/ai-suggestions - AI suggestions
  • POST /api/dashboard/ai-feedback - Submit feedback

Response Formats

All API responses follow standard JSON format with:

  • Success/error status
  • Data payload
  • Metadata (timestamps, pagination, etc.)

πŸ”’ Security

Authentication

  • API key authentication for production
  • Rate limiting on endpoints
  • Input validation and sanitization

Data Protection

  • Secure file upload handling
  • Temporary file cleanup
  • Database connection security

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Submit a pull request

Development Guidelines

  • Follow PEP 8 style guide
  • Add type hints to functions
  • Write comprehensive docstrings
  • Include unit tests
  • Update documentation

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Hugging Face for OCR models
  • FastAPI for the web framework
  • Gradio for the Space interface
  • Microsoft for TrOCR models

πŸ“ž Support

For support and questions:

  • Create an issue on GitHub
  • Check the documentation
  • Review the API docs at /docs

πŸ”„ Changelog

v1.0.0

  • Initial release
  • OCR pipeline with Hugging Face models
  • AI scoring engine
  • Dashboard interface
  • RESTful API
  • Hugging Face Space deployment