metadata

title: Legal Dashboard OCR System
sdk: docker
emoji: 🚀
colorFrom: indigo
colorTo: yellow
pinned: true

Legal Dashboard OCR System

AI-powered Persian legal document processing system with advanced OCR capabilities using Hugging Face models.

🚀 Features

Advanced OCR Processing: Hugging Face TrOCR models for Persian text extraction
AI-Powered Scoring: Intelligent document quality assessment and scoring
Automatic Categorization: AI-driven document category prediction
Real-time Dashboard: Live analytics and document management
WebSocket Support: Real-time updates and notifications
Comprehensive API: RESTful API for all operations
Persian Language Support: Optimized for Persian/Farsi legal documents

🏗️ Architecture

legal_dashboard_ocr/
├── app/                     # Backend application
│   ├── main.py             # FastAPI entry point
│   ├── api/                # API route handlers
│   │   ├── documents.py    # Document CRUD operations
│   │   ├── ocr.py         # OCR processing endpoints
│   │   └── dashboard.py   # Dashboard analytics
│   ├── services/           # Business logic services
│   │   ├── ocr_service.py # OCR pipeline
│   │   ├── database_service.py # Database operations
│   │   └── ai_service.py  # AI scoring engine
│   └── models/             # Data models
│       └── document_models.py
├── frontend/               # Web interface
│   ├── improved_legal_dashboard.html
│   └── test_integration.html
├── tests/                  # Test suite
│   ├── test_api_endpoints.py
│   └── test_ocr_pipeline.py
├── data/                   # Sample documents
│   └── sample_persian.pdf
├── huggingface_space/      # HF Space deployment
│   ├── app.py             # Gradio interface
│   ├── Spacefile          # Deployment config
│   └── README.md          # Space documentation
└── requirements.txt        # Dependencies

🛠️ Installation

Prerequisites

Python 3.10+
pip
Git

Setup

Clone the repository

git clone <repository-url>
cd legal_dashboard_ocr

Install dependencies
```
pip install -r requirements.txt
```

Set up environment variables

# Create .env file
echo "HF_TOKEN=your_huggingface_token" > .env

Run the application

# Start the FastAPI server
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

Access the application
- Web Dashboard: http://localhost:8000
- API Documentation: http://localhost:8000/docs
- Health Check: http://localhost:8000/health

📖 Usage

Web Interface

Upload PDF: Navigate to the dashboard and upload a Persian legal document
Process Document: Click "Process PDF" to extract text using OCR
Review Results: View extracted text, AI analysis, and quality metrics
Save Document: Optionally save processed documents to the database
View Analytics: Check dashboard statistics and trends

API Usage

Process PDF with OCR

curl -X POST "http://localhost:8000/api/ocr/process" \
  -H "Content-Type: multipart/form-data" \
  -F "[email protected]"

Get Documents

curl "http://localhost:8000/api/documents?limit=10&offset=0"

Create Document

curl -X POST "http://localhost:8000/api/documents/" \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Legal Document",
    "full_text": "Extracted text content",
    "source": "Uploaded",
    "category": "قانون"
  }'

Get Dashboard Summary

curl "http://localhost:8000/api/dashboard/summary"

🔧 Configuration

OCR Models

The system supports multiple Hugging Face OCR models:

microsoft/trocr-base-stage1: Default model for printed text
microsoft/trocr-base-handwritten: For handwritten text
microsoft/trocr-large-stage1: Higher accuracy model

AI Scoring Weights

The AI scoring engine uses configurable weights:

Keyword Relevance: 30%
Document Completeness: 25%
Recency: 20%
Source Credibility: 15%
Document Quality: 10%

Database

SQLite database with tables for:

Documents
AI training data
System metrics

🧪 Testing

Run Tests

# Run all tests
python -m pytest tests/

# Run specific test
python -m pytest tests/test_api_endpoints.py

# Run with coverage
python -m pytest tests/ --cov=app

Test Coverage

API endpoint testing
OCR pipeline validation
Database operations
AI scoring accuracy
Frontend integration

🚀 Deployment

Hugging Face Spaces

Create a new Space on Hugging Face
Upload the project files
Set environment variables:
- HF_TOKEN: Your Hugging Face token
Deploy the Space

Docker Deployment

FROM python:3.10-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .
EXPOSE 8000

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Production Deployment

Set up a production server
Install dependencies
Configure environment variables
Set up reverse proxy (nginx)

Run with gunicorn:

gunicorn app.main:app -w 4 -k uvicorn.workers.UvicornWorker

📊 API Documentation

Endpoints

Documents

GET /api/documents/ - List documents
POST /api/documents/ - Create document
GET /api/documents/{id} - Get document
PUT /api/documents/{id} - Update document
DELETE /api/documents/{id} - Delete document

OCR

POST /api/ocr/process - Process PDF
POST /api/ocr/process-and-save - Process and save
POST /api/ocr/batch-process - Batch processing
GET /api/ocr/status - OCR status

Dashboard

GET /api/dashboard/summary - Dashboard summary
GET /api/dashboard/charts-data - Chart data
GET /api/dashboard/ai-suggestions - AI suggestions
POST /api/dashboard/ai-feedback - Submit feedback

Response Formats

All API responses follow standard JSON format with:

Success/error status
Data payload
Metadata (timestamps, pagination, etc.)

🔒 Security

Authentication

API key authentication for production
Rate limiting on endpoints
Input validation and sanitization

Data Protection

Secure file upload handling
Temporary file cleanup
Database connection security

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests for new functionality
Submit a pull request

Development Guidelines

Follow PEP 8 style guide
Add type hints to functions
Write comprehensive docstrings
Include unit tests
Update documentation

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Hugging Face for OCR models
FastAPI for the web framework
Gradio for the Space interface
Microsoft for TrOCR models

📞 Support

For support and questions:

Create an issue on GitHub
Check the documentation
Review the API docs at /docs

🔄 Changelog

v1.0.0

Initial release
OCR pipeline with Hugging Face models
AI scoring engine
Dashboard interface
RESTful API
Hugging Face Space deployment