Spaces:
Paused
Paused
title: Legal Dashboard OCR System | |
sdk: docker | |
emoji: π | |
colorFrom: indigo | |
colorTo: yellow | |
pinned: true | |
# Legal Dashboard OCR System | |
AI-powered Persian legal document processing system with advanced OCR capabilities using Hugging Face models. | |
## π Features | |
- **Advanced OCR Processing**: Hugging Face TrOCR models for Persian text extraction | |
- **AI-Powered Scoring**: Intelligent document quality assessment and scoring | |
- **Automatic Categorization**: AI-driven document category prediction | |
- **Real-time Dashboard**: Live analytics and document management | |
- **WebSocket Support**: Real-time updates and notifications | |
- **Comprehensive API**: RESTful API for all operations | |
- **Persian Language Support**: Optimized for Persian/Farsi legal documents | |
## ποΈ Architecture | |
``` | |
legal_dashboard_ocr/ | |
βββ app/ # Backend application | |
β βββ main.py # FastAPI entry point | |
β βββ api/ # API route handlers | |
β β βββ documents.py # Document CRUD operations | |
β β βββ ocr.py # OCR processing endpoints | |
β β βββ dashboard.py # Dashboard analytics | |
β βββ services/ # Business logic services | |
β β βββ ocr_service.py # OCR pipeline | |
β β βββ database_service.py # Database operations | |
β β βββ ai_service.py # AI scoring engine | |
β βββ models/ # Data models | |
β βββ document_models.py | |
βββ frontend/ # Web interface | |
β βββ improved_legal_dashboard.html | |
β βββ test_integration.html | |
βββ tests/ # Test suite | |
β βββ test_api_endpoints.py | |
β βββ test_ocr_pipeline.py | |
βββ data/ # Sample documents | |
β βββ sample_persian.pdf | |
βββ huggingface_space/ # HF Space deployment | |
β βββ app.py # Gradio interface | |
β βββ Spacefile # Deployment config | |
β βββ README.md # Space documentation | |
βββ requirements.txt # Dependencies | |
``` | |
## π οΈ Installation | |
### Prerequisites | |
- Python 3.10+ | |
- pip | |
- Git | |
### Setup | |
1. **Clone the repository** | |
```bash | |
git clone <repository-url> | |
cd legal_dashboard_ocr | |
``` | |
2. **Install dependencies** | |
```bash | |
pip install -r requirements.txt | |
``` | |
3. **Set up environment variables** | |
```bash | |
# Create .env file | |
echo "HF_TOKEN=your_huggingface_token" > .env | |
``` | |
4. **Run the application** | |
```bash | |
# Start the FastAPI server | |
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload | |
``` | |
5. **Access the application** | |
- Web Dashboard: http://localhost:8000 | |
- API Documentation: http://localhost:8000/docs | |
- Health Check: http://localhost:8000/health | |
## π Usage | |
### Web Interface | |
1. **Upload PDF**: Navigate to the dashboard and upload a Persian legal document | |
2. **Process Document**: Click "Process PDF" to extract text using OCR | |
3. **Review Results**: View extracted text, AI analysis, and quality metrics | |
4. **Save Document**: Optionally save processed documents to the database | |
5. **View Analytics**: Check dashboard statistics and trends | |
### API Usage | |
#### Process PDF with OCR | |
```bash | |
curl -X POST "http://localhost:8000/api/ocr/process" \ | |
-H "Content-Type: multipart/form-data" \ | |
-F "[email protected]" | |
``` | |
#### Get Documents | |
```bash | |
curl "http://localhost:8000/api/documents?limit=10&offset=0" | |
``` | |
#### Create Document | |
```bash | |
curl -X POST "http://localhost:8000/api/documents/" \ | |
-H "Content-Type: application/json" \ | |
-d '{ | |
"title": "Legal Document", | |
"full_text": "Extracted text content", | |
"source": "Uploaded", | |
"category": "ΩΨ§ΩΩΩ" | |
}' | |
``` | |
#### Get Dashboard Summary | |
```bash | |
curl "http://localhost:8000/api/dashboard/summary" | |
``` | |
## π§ Configuration | |
### OCR Models | |
The system supports multiple Hugging Face OCR models: | |
- `microsoft/trocr-base-stage1`: Default model for printed text | |
- `microsoft/trocr-base-handwritten`: For handwritten text | |
- `microsoft/trocr-large-stage1`: Higher accuracy model | |
### AI Scoring Weights | |
The AI scoring engine uses configurable weights: | |
- Keyword Relevance: 30% | |
- Document Completeness: 25% | |
- Recency: 20% | |
- Source Credibility: 15% | |
- Document Quality: 10% | |
### Database | |
SQLite database with tables for: | |
- Documents | |
- AI training data | |
- System metrics | |
## π§ͺ Testing | |
### Run Tests | |
```bash | |
# Run all tests | |
python -m pytest tests/ | |
# Run specific test | |
python -m pytest tests/test_api_endpoints.py | |
# Run with coverage | |
python -m pytest tests/ --cov=app | |
``` | |
### Test Coverage | |
- API endpoint testing | |
- OCR pipeline validation | |
- Database operations | |
- AI scoring accuracy | |
- Frontend integration | |
## π Deployment | |
### Hugging Face Spaces | |
1. **Create a new Space** on Hugging Face | |
2. **Upload the project** files | |
3. **Set environment variables**: | |
- `HF_TOKEN`: Your Hugging Face token | |
4. **Deploy** the Space | |
### Docker Deployment | |
```dockerfile | |
FROM python:3.10-slim | |
WORKDIR /app | |
COPY requirements.txt . | |
RUN pip install -r requirements.txt | |
COPY . . | |
EXPOSE 8000 | |
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"] | |
``` | |
### Production Deployment | |
1. **Set up a production server** | |
2. **Install dependencies** | |
3. **Configure environment variables** | |
4. **Set up reverse proxy** (nginx) | |
5. **Run with gunicorn**: | |
```bash | |
gunicorn app.main:app -w 4 -k uvicorn.workers.UvicornWorker | |
``` | |
## π API Documentation | |
### Endpoints | |
#### Documents | |
- `GET /api/documents/` - List documents | |
- `POST /api/documents/` - Create document | |
- `GET /api/documents/{id}` - Get document | |
- `PUT /api/documents/{id}` - Update document | |
- `DELETE /api/documents/{id}` - Delete document | |
#### OCR | |
- `POST /api/ocr/process` - Process PDF | |
- `POST /api/ocr/process-and-save` - Process and save | |
- `POST /api/ocr/batch-process` - Batch processing | |
- `GET /api/ocr/status` - OCR status | |
#### Dashboard | |
- `GET /api/dashboard/summary` - Dashboard summary | |
- `GET /api/dashboard/charts-data` - Chart data | |
- `GET /api/dashboard/ai-suggestions` - AI suggestions | |
- `POST /api/dashboard/ai-feedback` - Submit feedback | |
### Response Formats | |
All API responses follow standard JSON format with: | |
- Success/error status | |
- Data payload | |
- Metadata (timestamps, pagination, etc.) | |
## π Security | |
### Authentication | |
- API key authentication for production | |
- Rate limiting on endpoints | |
- Input validation and sanitization | |
### Data Protection | |
- Secure file upload handling | |
- Temporary file cleanup | |
- Database connection security | |
## π€ Contributing | |
1. **Fork the repository** | |
2. **Create a feature branch** | |
3. **Make your changes** | |
4. **Add tests** for new functionality | |
5. **Submit a pull request** | |
### Development Guidelines | |
- Follow PEP 8 style guide | |
- Add type hints to functions | |
- Write comprehensive docstrings | |
- Include unit tests | |
- Update documentation | |
## π License | |
This project is licensed under the MIT License - see the LICENSE file for details. | |
## π Acknowledgments | |
- Hugging Face for OCR models | |
- FastAPI for the web framework | |
- Gradio for the Space interface | |
- Microsoft for TrOCR models | |
## π Support | |
For support and questions: | |
- Create an issue on GitHub | |
- Check the documentation | |
- Review the API docs at `/docs` | |
## π Changelog | |
### v1.0.0 | |
- Initial release | |
- OCR pipeline with Hugging Face models | |
- AI scoring engine | |
- Dashboard interface | |
- RESTful API | |
- Hugging Face Space deployment |