Spaces:
Paused
Legal Dashboard - Scraping & Rating System Documentation
Overview
The Legal Dashboard Scraping & Rating System is a comprehensive web scraping and data quality evaluation platform designed specifically for legal document processing. The system provides advanced scraping capabilities with multiple strategies, intelligent data rating, and a modern web dashboard for monitoring and control.
Features
🕷️ Advanced Web Scraping
- Multiple Scraping Strategies: General, Legal Documents, News Articles, Academic Papers, Government Sites, Custom
- Async Processing: High-performance asynchronous scraping with configurable delays
- Content Extraction: Intelligent content extraction based on strategy and page structure
- Error Handling: Comprehensive error handling and logging
- Rate Limiting: Built-in rate limiting to respect website policies
⭐ Intelligent Data Rating
- Multi-Criteria Evaluation: Source credibility, content completeness, OCR accuracy, data freshness, content relevance, technical quality
- Dynamic Scoring: Real-time rating updates as data is processed
- Quality Indicators: Automatic detection of legal document patterns and quality markers
- Confidence Scoring: Statistical confidence levels for rating accuracy
📊 Real-Time Dashboard
- Live Monitoring: Real-time job progress and system statistics
- Interactive Charts: Rating distribution and language analysis
- Job Management: Start, monitor, and control scraping jobs
- Data Visualization: Comprehensive statistics and analytics
🔧 API-First Design
- RESTful API: Complete REST API for all operations
- WebSocket Support: Real-time updates and notifications
- Comprehensive Endpoints: Full CRUD operations for scraping and rating
- Health Monitoring: System health checks and status monitoring
Architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Frontend │ │ FastAPI │ │ Database │
│ Dashboard │◄──►│ Backend │◄──►│ SQLite │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ Services │
│ │
│ • Scraping │
│ • Rating │
│ • OCR │
└─────────────────┘
Installation & Setup
Prerequisites
- Python 3.8+
- FastAPI
- SQLite3
- Required Python packages (see requirements.txt)
Quick Start
- Clone the repository:
git clone <repository-url>
cd legal_dashboard_ocr
- Install dependencies:
pip install -r requirements.txt
- Start the application:
cd legal_dashboard_ocr
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
- Access the dashboard:
http://localhost:8000/scraping_dashboard.html
Docker Deployment
# Build the Docker image
docker build -t legal-dashboard-scraping .
# Run the container
docker run -p 8000:8000 legal-dashboard-scraping
API Reference
Scraping Endpoints
POST /api/scrape
Start a new scraping job.
Request Body:
{
"urls": ["https://example.com/page1", "https://example.com/page2"],
"strategy": "legal_documents",
"keywords": ["contract", "agreement"],
"content_types": ["html", "pdf"],
"max_depth": 1,
"delay_between_requests": 1.0
}
Response:
{
"job_id": "scrape_job_20240101_120000_abc123",
"status": "started",
"message": "Scraping job started successfully with 2 URLs"
}
GET /api/scrape/status
Get status of all scraping jobs.
Response:
[
{
"job_id": "scrape_job_20240101_120000_abc123",
"status": "processing",
"total_items": 2,
"completed_items": 1,
"failed_items": 0,
"progress": 0.5,
"created_at": "2024-01-01T12:00:00Z",
"strategy": "legal_documents"
}
]
GET /api/scrape/items
Get scraped items with optional filtering.
Query Parameters:
job_id
(optional): Filter by job IDlimit
(default: 100): Maximum items to returnoffset
(default: 0): Number of items to skip
Response:
[
{
"id": "item_20240101_120000_def456",
"url": "https://example.com/page1",
"title": "Legal Document Title",
"content": "Extracted content...",
"metadata": {...},
"timestamp": "2024-01-01T12:00:00Z",
"rating_score": 0.85,
"processing_status": "completed",
"word_count": 1500,
"language": "english",
"domain": "example.com"
}
]
Rating Endpoints
POST /api/rating/rate-all
Rate all unrated scraped items.
Response:
{
"total_items": 50,
"rated_count": 45,
"failed_count": 5,
"message": "Rated 45 items, 5 failed"
}
GET /api/rating/summary
Get comprehensive rating summary.
Response:
{
"total_rated": 100,
"average_score": 0.75,
"score_range": {
"min": 0.2,
"max": 0.95
},
"average_confidence": 0.82,
"rating_level_distribution": {
"excellent": 25,
"good": 40,
"average": 25,
"poor": 10
},
"criteria_averages": {
"source_credibility": 0.8,
"content_completeness": 0.7,
"ocr_accuracy": 0.85
},
"recent_ratings_24h": 15
}
GET /api/rating/low-quality
Get items with low quality ratings.
Query Parameters:
threshold
(default: 0.4): Quality thresholdlimit
(default: 50): Maximum items to return
Response:
{
"threshold": 0.4,
"total_items": 10,
"items": [...]
}
Scraping Strategies
1. General Strategy
- Extracts all text content from web pages
- Suitable for general web scraping tasks
- Minimal content filtering
2. Legal Documents Strategy
- Focuses on legal document content
- Extracts structured legal text
- Identifies legal patterns and terminology
- Optimized for Persian and English legal content
3. News Articles Strategy
- Extracts news article content
- Removes navigation and advertising
- Focuses on article body and headlines
4. Academic Papers Strategy
- Extracts academic content
- Preserves citations and references
- Maintains document structure
5. Government Sites Strategy
- Optimized for government websites
- Extracts official documents and announcements
- Handles government-specific content structures
6. Custom Strategy
- User-defined content extraction rules
- Configurable selectors and patterns
- Flexible content processing
Rating Criteria
Source Credibility (25%)
- Domain authority and reputation
- Government/educational institution status
- HTTPS security
- Official indicators in metadata
Content Completeness (25%)
- Word count and content length
- Structured content (chapters, sections)
- Legal document patterns
- Quality indicators
OCR Accuracy (20%)
- Text quality and readability
- Character recognition accuracy
- Sentence structure quality
- Formatting consistency
Data Freshness (15%)
- Content age and timeliness
- Update frequency
- Historical relevance
Content Relevance (10%)
- Legal terminology density
- Domain-specific language
- Official language indicators
Technical Quality (5%)
- Document structure
- Formatting consistency
- Metadata quality
- Content organization
Database Schema
scraped_items Table
CREATE TABLE scraped_items (
id TEXT PRIMARY KEY,
url TEXT NOT NULL,
title TEXT,
content TEXT,
metadata TEXT,
timestamp TEXT,
source_url TEXT,
rating_score REAL DEFAULT 0.0,
processing_status TEXT DEFAULT 'pending',
error_message TEXT,
strategy_used TEXT,
content_hash TEXT,
word_count INTEGER DEFAULT 0,
language TEXT DEFAULT 'unknown',
domain TEXT
);
rating_results Table
CREATE TABLE rating_results (
id INTEGER PRIMARY KEY AUTOINCREMENT,
item_id TEXT NOT NULL,
overall_score REAL,
criteria_scores TEXT,
rating_level TEXT,
confidence REAL,
timestamp TEXT,
evaluator TEXT,
notes TEXT,
FOREIGN KEY (item_id) REFERENCES scraped_items (id)
);
scraping_jobs Table
CREATE TABLE scraping_jobs (
job_id TEXT PRIMARY KEY,
urls TEXT,
strategy TEXT,
keywords TEXT,
content_types TEXT,
max_depth INTEGER DEFAULT 1,
delay_between_requests REAL DEFAULT 1.0,
timeout INTEGER DEFAULT 30,
created_at TEXT,
status TEXT DEFAULT 'pending',
total_items INTEGER DEFAULT 0,
completed_items INTEGER DEFAULT 0,
failed_items INTEGER DEFAULT 0
);
Configuration
Rating Configuration
from app.services.rating_service import RatingConfig
config = RatingConfig(
source_credibility_weight=0.25,
content_completeness_weight=0.25,
ocr_accuracy_weight=0.20,
data_freshness_weight=0.15,
content_relevance_weight=0.10,
technical_quality_weight=0.05,
excellent_threshold=0.8,
good_threshold=0.6,
average_threshold=0.4,
poor_threshold=0.2
)
Scraping Configuration
from app.services.scraping_service import ScrapingService
scraping_service = ScrapingService(
db_path="legal_documents.db",
max_workers=10,
timeout=30,
user_agent="Legal-Dashboard-Scraper/1.0"
)
Usage Examples
Starting a Scraping Job
import asyncio
from app.services.scraping_service import ScrapingService, ScrapingStrategy
async def scrape_legal_documents():
service = ScrapingService()
urls = [
"https://court.gov.ir/document1",
"https://justice.gov.ir/document2"
]
job_id = await service.start_scraping_job(
urls=urls,
strategy=ScrapingStrategy.LEGAL_DOCUMENTS,
keywords=["قرارداد", "contract", "agreement"],
max_depth=1,
delay=2.0
)
print(f"Started scraping job: {job_id}")
# Run the scraping job
asyncio.run(scrape_legal_documents())
Rating Scraped Items
import asyncio
from app.services.rating_service import RatingService
async def rate_items():
service = RatingService()
# Get scraped items
items = await scraping_service.get_scraped_items()
# Rate each item
for item in items:
if item['rating_score'] == 0.0: # Unrated items
result = await service.rate_item(item)
print(f"Rated item {item['id']}: {result.rating_level.value} ({result.overall_score})")
# Run the rating process
asyncio.run(rate_items())
API Integration
import requests
# Start a scraping job
response = requests.post("http://localhost:8000/api/scrape", json={
"urls": ["https://example.com/legal-doc"],
"strategy": "legal_documents",
"max_depth": 1
})
job_id = response.json()["job_id"]
# Monitor job progress
while True:
status_response = requests.get(f"http://localhost:8000/api/scrape/status/{job_id}")
status = status_response.json()
if status["status"] == "completed":
break
time.sleep(5)
# Get rated items
items_response = requests.get("http://localhost:8000/api/scrape/items")
items = items_response.json()
# Get rating summary
summary_response = requests.get("http://localhost:8000/api/rating/summary")
summary = summary_response.json()
Testing
Running Tests
# Run all tests
pytest tests/test_scraping_system.py -v
# Run specific test categories
pytest tests/test_scraping_system.py::TestScrapingService -v
pytest tests/test_scraping_system.py::TestRatingService -v
pytest tests/test_scraping_system.py::TestScrapingAPI -v
# Run with coverage
pytest tests/test_scraping_system.py --cov=app.services --cov-report=html
Test Categories
- Unit Tests: Individual component testing
- Integration Tests: End-to-end workflow testing
- API Tests: REST API endpoint testing
- Performance Tests: Load and stress testing
- Error Handling Tests: Exception and error scenario testing
Monitoring & Logging
Log Levels
- INFO: General operational information
- WARNING: Non-critical issues and warnings
- ERROR: Error conditions and failures
- DEBUG: Detailed debugging information
Key Metrics
- Scraping Jobs: Active jobs, completion rates, failure rates
- Data Quality: Average ratings, rating distributions, quality trends
- System Performance: Response times, throughput, resource usage
- Error Rates: Failed requests, parsing errors, rating failures
Health Checks
# Check system health
curl http://localhost:8000/api/health
# Check scraping service health
curl http://localhost:8000/api/scrape/statistics
# Check rating service health
curl http://localhost:8000/api/rating/summary
Troubleshooting
Common Issues
1. Scraping Jobs Not Starting
Symptoms: Jobs remain in "pending" status Solutions:
- Check network connectivity
- Verify URL accessibility
- Review rate limiting settings
- Check server logs for errors
2. Low Rating Scores
Symptoms: Items consistently getting low ratings Solutions:
- Review content quality and completeness
- Check source credibility settings
- Adjust rating criteria weights
- Verify OCR accuracy for text extraction
3. Database Errors
Symptoms: Database connection failures or data corruption Solutions:
- Check database file permissions
- Verify SQLite installation
- Review database schema
- Check for disk space issues
4. Performance Issues
Symptoms: Slow response times or high resource usage Solutions:
- Reduce concurrent scraping jobs
- Increase delay between requests
- Optimize database queries
- Review memory usage patterns
Debug Mode
Enable debug logging for detailed troubleshooting:
import logging
logging.basicConfig(level=logging.DEBUG)
Error Recovery
The system includes automatic error recovery mechanisms:
- Job Retry: Failed scraping jobs can be retried
- Data Validation: Automatic validation of scraped content
- Graceful Degradation: System continues operating with partial failures
- Error Logging: Comprehensive error logging for analysis
Security Considerations
Data Protection
- Encryption: Sensitive data encrypted at rest
- Access Control: API authentication and authorization
- Input Validation: Comprehensive input sanitization
- Rate Limiting: Protection against abuse
Privacy Compliance
- Data Retention: Configurable data retention policies
- User Consent: Respect for website terms of service
- Data Minimization: Only necessary data is collected
- Right to Deletion: User data can be deleted on request
Network Security
- HTTPS: All communications encrypted
- Certificate Validation: Proper SSL certificate validation
- Firewall Rules: Network access controls
- DDoS Protection: Rate limiting and traffic filtering
Performance Optimization
Scraping Performance
- Async Processing: Non-blocking I/O operations
- Connection Pooling: Reuse HTTP connections
- Caching: Cache frequently accessed content
- Parallel Processing: Multiple concurrent scraping jobs
Database Performance
- Indexing: Optimized database indexes
- Query Optimization: Efficient SQL queries
- Connection Pooling: Database connection management
- Data Archiving: Automatic archiving of old data
Memory Management
- Streaming: Process large datasets in chunks
- Garbage Collection: Proper memory cleanup
- Resource Limits: Configurable memory limits
- Monitoring: Real-time memory usage tracking
Future Enhancements
Planned Features
- Machine Learning: Advanced content classification
- Natural Language Processing: Enhanced text analysis
- Multi-language Support: Additional language support
- Cloud Integration: Cloud storage and processing
- Advanced Analytics: Detailed analytics and reporting
Scalability Improvements
- Microservices Architecture: Service decomposition
- Load Balancing: Distributed processing
- Caching Layer: Redis integration
- Message Queues: Asynchronous processing
Support & Contributing
Getting Help
- Documentation: Comprehensive documentation and examples
- Community: Active community support
- Issues: GitHub issue tracking
- Discussions: Community discussions and Q&A
Contributing
- Code Standards: Follow PEP 8 and project guidelines
- Testing: Include comprehensive tests
- Documentation: Update documentation for changes
- Review Process: Code review and approval process
License
This project is licensed under the MIT License. See LICENSE file for details.
Note: This documentation is continuously updated. For the latest version, please check the project repository.