Hoghoghi / README.md
Really-amin's picture
Upload 58 files
84fb503 verified
|
raw
history blame
7.78 kB
---
title: Legal Dashboard OCR System
sdk: docker
emoji: πŸš€
colorFrom: indigo
colorTo: yellow
pinned: true
---
# Legal Dashboard OCR System
AI-powered Persian legal document processing system with advanced OCR capabilities using Hugging Face models.
## πŸš€ Features
- **Advanced OCR Processing**: Hugging Face TrOCR models for Persian text extraction
- **AI-Powered Scoring**: Intelligent document quality assessment and scoring
- **Automatic Categorization**: AI-driven document category prediction
- **Real-time Dashboard**: Live analytics and document management
- **WebSocket Support**: Real-time updates and notifications
- **Comprehensive API**: RESTful API for all operations
- **Persian Language Support**: Optimized for Persian/Farsi legal documents
## πŸ—οΈ Architecture
```
legal_dashboard_ocr/
β”œβ”€β”€ app/ # Backend application
β”‚ β”œβ”€β”€ main.py # FastAPI entry point
β”‚ β”œβ”€β”€ api/ # API route handlers
β”‚ β”‚ β”œβ”€β”€ documents.py # Document CRUD operations
β”‚ β”‚ β”œβ”€β”€ ocr.py # OCR processing endpoints
β”‚ β”‚ └── dashboard.py # Dashboard analytics
β”‚ β”œβ”€β”€ services/ # Business logic services
β”‚ β”‚ β”œβ”€β”€ ocr_service.py # OCR pipeline
β”‚ β”‚ β”œβ”€β”€ database_service.py # Database operations
β”‚ β”‚ └── ai_service.py # AI scoring engine
β”‚ └── models/ # Data models
β”‚ └── document_models.py
β”œβ”€β”€ frontend/ # Web interface
β”‚ β”œβ”€β”€ improved_legal_dashboard.html
β”‚ └── test_integration.html
β”œβ”€β”€ tests/ # Test suite
β”‚ β”œβ”€β”€ test_api_endpoints.py
β”‚ └── test_ocr_pipeline.py
β”œβ”€β”€ data/ # Sample documents
β”‚ └── sample_persian.pdf
β”œβ”€β”€ huggingface_space/ # HF Space deployment
β”‚ β”œβ”€β”€ app.py # Gradio interface
β”‚ β”œβ”€β”€ Spacefile # Deployment config
β”‚ └── README.md # Space documentation
└── requirements.txt # Dependencies
```
## πŸ› οΈ Installation
### Prerequisites
- Python 3.10+
- pip
- Git
### Setup
1. **Clone the repository**
```bash
git clone <repository-url>
cd legal_dashboard_ocr
```
2. **Install dependencies**
```bash
pip install -r requirements.txt
```
3. **Set up environment variables**
```bash
# Create .env file
echo "HF_TOKEN=your_huggingface_token" > .env
```
4. **Run the application**
```bash
# Start the FastAPI server
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
```
5. **Access the application**
- Web Dashboard: http://localhost:8000
- API Documentation: http://localhost:8000/docs
- Health Check: http://localhost:8000/health
## πŸ“– Usage
### Web Interface
1. **Upload PDF**: Navigate to the dashboard and upload a Persian legal document
2. **Process Document**: Click "Process PDF" to extract text using OCR
3. **Review Results**: View extracted text, AI analysis, and quality metrics
4. **Save Document**: Optionally save processed documents to the database
5. **View Analytics**: Check dashboard statistics and trends
### API Usage
#### Process PDF with OCR
```bash
curl -X POST "http://localhost:8000/api/ocr/process" \
-H "Content-Type: multipart/form-data" \
-F "[email protected]"
```
#### Get Documents
```bash
curl "http://localhost:8000/api/documents?limit=10&offset=0"
```
#### Create Document
```bash
curl -X POST "http://localhost:8000/api/documents/" \
-H "Content-Type: application/json" \
-d '{
"title": "Legal Document",
"full_text": "Extracted text content",
"source": "Uploaded",
"category": "Ω‚Ψ§Ω†ΩˆΩ†"
}'
```
#### Get Dashboard Summary
```bash
curl "http://localhost:8000/api/dashboard/summary"
```
## πŸ”§ Configuration
### OCR Models
The system supports multiple Hugging Face OCR models:
- `microsoft/trocr-base-stage1`: Default model for printed text
- `microsoft/trocr-base-handwritten`: For handwritten text
- `microsoft/trocr-large-stage1`: Higher accuracy model
### AI Scoring Weights
The AI scoring engine uses configurable weights:
- Keyword Relevance: 30%
- Document Completeness: 25%
- Recency: 20%
- Source Credibility: 15%
- Document Quality: 10%
### Database
SQLite database with tables for:
- Documents
- AI training data
- System metrics
## πŸ§ͺ Testing
### Run Tests
```bash
# Run all tests
python -m pytest tests/
# Run specific test
python -m pytest tests/test_api_endpoints.py
# Run with coverage
python -m pytest tests/ --cov=app
```
### Test Coverage
- API endpoint testing
- OCR pipeline validation
- Database operations
- AI scoring accuracy
- Frontend integration
## πŸš€ Deployment
### Hugging Face Spaces
1. **Create a new Space** on Hugging Face
2. **Upload the project** files
3. **Set environment variables**:
- `HF_TOKEN`: Your Hugging Face token
4. **Deploy** the Space
### Docker Deployment
```dockerfile
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
```
### Production Deployment
1. **Set up a production server**
2. **Install dependencies**
3. **Configure environment variables**
4. **Set up reverse proxy** (nginx)
5. **Run with gunicorn**:
```bash
gunicorn app.main:app -w 4 -k uvicorn.workers.UvicornWorker
```
## πŸ“Š API Documentation
### Endpoints
#### Documents
- `GET /api/documents/` - List documents
- `POST /api/documents/` - Create document
- `GET /api/documents/{id}` - Get document
- `PUT /api/documents/{id}` - Update document
- `DELETE /api/documents/{id}` - Delete document
#### OCR
- `POST /api/ocr/process` - Process PDF
- `POST /api/ocr/process-and-save` - Process and save
- `POST /api/ocr/batch-process` - Batch processing
- `GET /api/ocr/status` - OCR status
#### Dashboard
- `GET /api/dashboard/summary` - Dashboard summary
- `GET /api/dashboard/charts-data` - Chart data
- `GET /api/dashboard/ai-suggestions` - AI suggestions
- `POST /api/dashboard/ai-feedback` - Submit feedback
### Response Formats
All API responses follow standard JSON format with:
- Success/error status
- Data payload
- Metadata (timestamps, pagination, etc.)
## πŸ”’ Security
### Authentication
- API key authentication for production
- Rate limiting on endpoints
- Input validation and sanitization
### Data Protection
- Secure file upload handling
- Temporary file cleanup
- Database connection security
## 🀝 Contributing
1. **Fork the repository**
2. **Create a feature branch**
3. **Make your changes**
4. **Add tests** for new functionality
5. **Submit a pull request**
### Development Guidelines
- Follow PEP 8 style guide
- Add type hints to functions
- Write comprehensive docstrings
- Include unit tests
- Update documentation
## πŸ“ License
This project is licensed under the MIT License - see the LICENSE file for details.
## πŸ™ Acknowledgments
- Hugging Face for OCR models
- FastAPI for the web framework
- Gradio for the Space interface
- Microsoft for TrOCR models
## πŸ“ž Support
For support and questions:
- Create an issue on GitHub
- Check the documentation
- Review the API docs at `/docs`
## πŸ”„ Changelog
### v1.0.0
- Initial release
- OCR pipeline with Hugging Face models
- AI scoring engine
- Dashboard interface
- RESTful API
- Hugging Face Space deployment