Spaces:

Really-amin
/

Hoghoghi

Paused

App Files Files Community

Hoghoghi / README.md

Really-amin

Upload 58 files

84fb503 verified about 2 months ago

preview code

raw

history blame

7.78 kB

	---
	title: Legal Dashboard OCR System
	sdk: docker
	emoji: 🚀
	colorFrom: indigo
	colorTo: yellow
	pinned: true
	---

	# Legal Dashboard OCR System

	AI-powered Persian legal document processing system with advanced OCR capabilities using Hugging Face models.

	## 🚀 Features

	- Advanced OCR Processing: Hugging Face TrOCR models for Persian text extraction
	- AI-Powered Scoring: Intelligent document quality assessment and scoring
	- Automatic Categorization: AI-driven document category prediction
	- Real-time Dashboard: Live analytics and document management
	- WebSocket Support: Real-time updates and notifications
	- Comprehensive API: RESTful API for all operations
	- Persian Language Support: Optimized for Persian/Farsi legal documents

	## 🏗️ Architecture

	```
	legal_dashboard_ocr/
	├── app/ # Backend application
	│ ├── main.py # FastAPI entry point
	│ ├── api/ # API route handlers
	│ │ ├── documents.py # Document CRUD operations
	│ │ ├── ocr.py # OCR processing endpoints
	│ │ └── dashboard.py # Dashboard analytics
	│ ├── services/ # Business logic services
	│ │ ├── ocr_service.py # OCR pipeline
	│ │ ├── database_service.py # Database operations
	│ │ └── ai_service.py # AI scoring engine
	│ └── models/ # Data models
	│ └── document_models.py
	├── frontend/ # Web interface
	│ ├── improved_legal_dashboard.html
	│ └── test_integration.html
	├── tests/ # Test suite
	│ ├── test_api_endpoints.py
	│ └── test_ocr_pipeline.py
	├── data/ # Sample documents
	│ └── sample_persian.pdf
	├── huggingface_space/ # HF Space deployment
	│ ├── app.py # Gradio interface
	│ ├── Spacefile # Deployment config
	│ └── README.md # Space documentation
	└── requirements.txt # Dependencies
	```

	## 🛠️ Installation

	### Prerequisites

	- Python 3.10+
	- pip
	- Git

	### Setup

	1. Clone the repository
	```bash
	git clone <repository-url>
	cd legal_dashboard_ocr
	```

	2. Install dependencies
	```bash
	pip install -r requirements.txt
	```

	3. Set up environment variables
	```bash
	# Create .env file
	echo "HF_TOKEN=your_huggingface_token" > .env
	```

	4. Run the application
	```bash
	# Start the FastAPI server
	uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
	```

	5. Access the application
	- Web Dashboard: http://localhost:8000
	- API Documentation: http://localhost:8000/docs
	- Health Check: http://localhost:8000/health

	## 📖 Usage

	### Web Interface

	1. Upload PDF: Navigate to the dashboard and upload a Persian legal document
	2. Process Document: Click "Process PDF" to extract text using OCR
	3. Review Results: View extracted text, AI analysis, and quality metrics
	4. Save Document: Optionally save processed documents to the database
	5. View Analytics: Check dashboard statistics and trends

	### API Usage

	#### Process PDF with OCR
	```bash
	curl -X POST "http://localhost:8000/api/ocr/process" \
	-H "Content-Type: multipart/form-data" \
	-F "[email protected]"
	```

	#### Get Documents
	```bash
	curl "http://localhost:8000/api/documents?limit=10&offset=0"
	```

	#### Create Document
	```bash
	curl -X POST "http://localhost:8000/api/documents/" \
	-H "Content-Type: application/json" \
	-d '{
	"title": "Legal Document",
	"full_text": "Extracted text content",
	"source": "Uploaded",
	"category": "قانون"
	}'
	```

	#### Get Dashboard Summary
	```bash
	curl "http://localhost:8000/api/dashboard/summary"
	```

	## 🔧 Configuration

	### OCR Models

	The system supports multiple Hugging Face OCR models:

	- `microsoft/trocr-base-stage1`: Default model for printed text
	- `microsoft/trocr-base-handwritten`: For handwritten text
	- `microsoft/trocr-large-stage1`: Higher accuracy model

	### AI Scoring Weights

	The AI scoring engine uses configurable weights:

	- Keyword Relevance: 30%
	- Document Completeness: 25%
	- Recency: 20%
	- Source Credibility: 15%
	- Document Quality: 10%

	### Database

	SQLite database with tables for:
	- Documents
	- AI training data
	- System metrics

	## 🧪 Testing

	### Run Tests
	```bash
	# Run all tests
	python -m pytest tests/

	# Run specific test
	python -m pytest tests/test_api_endpoints.py

	# Run with coverage
	python -m pytest tests/ --cov=app
	```

	### Test Coverage
	- API endpoint testing
	- OCR pipeline validation
	- Database operations
	- AI scoring accuracy
	- Frontend integration

	## 🚀 Deployment

	### Hugging Face Spaces

	1. Create a new Space on Hugging Face
	2. Upload the project files
	3. Set environment variables:
	- `HF_TOKEN`: Your Hugging Face token
	4. Deploy the Space

	### Docker Deployment

	```dockerfile
	FROM python:3.10-slim

	WORKDIR /app
	COPY requirements.txt .
	RUN pip install -r requirements.txt

	COPY . .
	EXPOSE 8000

	CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
	```

	### Production Deployment

	1. Set up a production server
	2. Install dependencies
	3. Configure environment variables
	4. Set up reverse proxy (nginx)
	5. Run with gunicorn:
	```bash
	gunicorn app.main:app -w 4 -k uvicorn.workers.UvicornWorker
	```

	## 📊 API Documentation

	### Endpoints

	#### Documents
	- `GET /api/documents/` - List documents
	- `POST /api/documents/` - Create document
	- `GET /api/documents/{id}` - Get document
	- `PUT /api/documents/{id}` - Update document
	- `DELETE /api/documents/{id}` - Delete document

	#### OCR
	- `POST /api/ocr/process` - Process PDF
	- `POST /api/ocr/process-and-save` - Process and save
	- `POST /api/ocr/batch-process` - Batch processing
	- `GET /api/ocr/status` - OCR status

	#### Dashboard
	- `GET /api/dashboard/summary` - Dashboard summary
	- `GET /api/dashboard/charts-data` - Chart data
	- `GET /api/dashboard/ai-suggestions` - AI suggestions
	- `POST /api/dashboard/ai-feedback` - Submit feedback

	### Response Formats

	All API responses follow standard JSON format with:
	- Success/error status
	- Data payload
	- Metadata (timestamps, pagination, etc.)

	## 🔒 Security

	### Authentication
	- API key authentication for production
	- Rate limiting on endpoints
	- Input validation and sanitization

	### Data Protection
	- Secure file upload handling
	- Temporary file cleanup
	- Database connection security

	## 🤝 Contributing

	1. Fork the repository
	2. Create a feature branch
	3. Make your changes
	4. Add tests for new functionality
	5. Submit a pull request

	### Development Guidelines

	- Follow PEP 8 style guide
	- Add type hints to functions
	- Write comprehensive docstrings
	- Include unit tests
	- Update documentation

	## 📝 License

	This project is licensed under the MIT License - see the LICENSE file for details.

	## 🙏 Acknowledgments

	- Hugging Face for OCR models
	- FastAPI for the web framework
	- Gradio for the Space interface
	- Microsoft for TrOCR models

	## 📞 Support

	For support and questions:
	- Create an issue on GitHub
	- Check the documentation
	- Review the API docs at `/docs`

	## 🔄 Changelog

	### v1.0.0
	- Initial release
	- OCR pipeline with Hugging Face models
	- AI scoring engine
	- Dashboard interface
	- RESTful API
	- Hugging Face Space deployment