Hoghoghi / Doc /SCRAPING_SYSTEM_DOCUMENTATION.md
Really-amin's picture
Upload 143 files
c636ebf verified
|
raw
history blame
18 kB

Legal Dashboard - Scraping & Rating System Documentation

Overview

The Legal Dashboard Scraping & Rating System is a comprehensive web scraping and data quality evaluation platform designed specifically for legal document processing. The system provides advanced scraping capabilities with multiple strategies, intelligent data rating, and a modern web dashboard for monitoring and control.

Features

🕷️ Advanced Web Scraping

  • Multiple Scraping Strategies: General, Legal Documents, News Articles, Academic Papers, Government Sites, Custom
  • Async Processing: High-performance asynchronous scraping with configurable delays
  • Content Extraction: Intelligent content extraction based on strategy and page structure
  • Error Handling: Comprehensive error handling and logging
  • Rate Limiting: Built-in rate limiting to respect website policies

⭐ Intelligent Data Rating

  • Multi-Criteria Evaluation: Source credibility, content completeness, OCR accuracy, data freshness, content relevance, technical quality
  • Dynamic Scoring: Real-time rating updates as data is processed
  • Quality Indicators: Automatic detection of legal document patterns and quality markers
  • Confidence Scoring: Statistical confidence levels for rating accuracy

📊 Real-Time Dashboard

  • Live Monitoring: Real-time job progress and system statistics
  • Interactive Charts: Rating distribution and language analysis
  • Job Management: Start, monitor, and control scraping jobs
  • Data Visualization: Comprehensive statistics and analytics

🔧 API-First Design

  • RESTful API: Complete REST API for all operations
  • WebSocket Support: Real-time updates and notifications
  • Comprehensive Endpoints: Full CRUD operations for scraping and rating
  • Health Monitoring: System health checks and status monitoring

Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Frontend      │    │   FastAPI       │    │   Database      │
│   Dashboard     │◄──►│   Backend       │◄──►│   SQLite        │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                              │
                              ▼
                       ┌─────────────────┐
                       │   Services      │
                       │                 │
                       │ • Scraping      │
                       │ • Rating        │
                       │ • OCR           │
                       └─────────────────┘

Installation & Setup

Prerequisites

  • Python 3.8+
  • FastAPI
  • SQLite3
  • Required Python packages (see requirements.txt)

Quick Start

  1. Clone the repository:
git clone <repository-url>
cd legal_dashboard_ocr
  1. Install dependencies:
pip install -r requirements.txt
  1. Start the application:
cd legal_dashboard_ocr
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
  1. Access the dashboard:
http://localhost:8000/scraping_dashboard.html

Docker Deployment

# Build the Docker image
docker build -t legal-dashboard-scraping .

# Run the container
docker run -p 8000:8000 legal-dashboard-scraping

API Reference

Scraping Endpoints

POST /api/scrape

Start a new scraping job.

Request Body:

{
  "urls": ["https://example.com/page1", "https://example.com/page2"],
  "strategy": "legal_documents",
  "keywords": ["contract", "agreement"],
  "content_types": ["html", "pdf"],
  "max_depth": 1,
  "delay_between_requests": 1.0
}

Response:

{
  "job_id": "scrape_job_20240101_120000_abc123",
  "status": "started",
  "message": "Scraping job started successfully with 2 URLs"
}

GET /api/scrape/status

Get status of all scraping jobs.

Response:

[
  {
    "job_id": "scrape_job_20240101_120000_abc123",
    "status": "processing",
    "total_items": 2,
    "completed_items": 1,
    "failed_items": 0,
    "progress": 0.5,
    "created_at": "2024-01-01T12:00:00Z",
    "strategy": "legal_documents"
  }
]

GET /api/scrape/items

Get scraped items with optional filtering.

Query Parameters:

  • job_id (optional): Filter by job ID
  • limit (default: 100): Maximum items to return
  • offset (default: 0): Number of items to skip

Response:

[
  {
    "id": "item_20240101_120000_def456",
    "url": "https://example.com/page1",
    "title": "Legal Document Title",
    "content": "Extracted content...",
    "metadata": {...},
    "timestamp": "2024-01-01T12:00:00Z",
    "rating_score": 0.85,
    "processing_status": "completed",
    "word_count": 1500,
    "language": "english",
    "domain": "example.com"
  }
]

Rating Endpoints

POST /api/rating/rate-all

Rate all unrated scraped items.

Response:

{
  "total_items": 50,
  "rated_count": 45,
  "failed_count": 5,
  "message": "Rated 45 items, 5 failed"
}

GET /api/rating/summary

Get comprehensive rating summary.

Response:

{
  "total_rated": 100,
  "average_score": 0.75,
  "score_range": {
    "min": 0.2,
    "max": 0.95
  },
  "average_confidence": 0.82,
  "rating_level_distribution": {
    "excellent": 25,
    "good": 40,
    "average": 25,
    "poor": 10
  },
  "criteria_averages": {
    "source_credibility": 0.8,
    "content_completeness": 0.7,
    "ocr_accuracy": 0.85
  },
  "recent_ratings_24h": 15
}

GET /api/rating/low-quality

Get items with low quality ratings.

Query Parameters:

  • threshold (default: 0.4): Quality threshold
  • limit (default: 50): Maximum items to return

Response:

{
  "threshold": 0.4,
  "total_items": 10,
  "items": [...]
}

Scraping Strategies

1. General Strategy

  • Extracts all text content from web pages
  • Suitable for general web scraping tasks
  • Minimal content filtering

2. Legal Documents Strategy

  • Focuses on legal document content
  • Extracts structured legal text
  • Identifies legal patterns and terminology
  • Optimized for Persian and English legal content

3. News Articles Strategy

  • Extracts news article content
  • Removes navigation and advertising
  • Focuses on article body and headlines

4. Academic Papers Strategy

  • Extracts academic content
  • Preserves citations and references
  • Maintains document structure

5. Government Sites Strategy

  • Optimized for government websites
  • Extracts official documents and announcements
  • Handles government-specific content structures

6. Custom Strategy

  • User-defined content extraction rules
  • Configurable selectors and patterns
  • Flexible content processing

Rating Criteria

Source Credibility (25%)

  • Domain authority and reputation
  • Government/educational institution status
  • HTTPS security
  • Official indicators in metadata

Content Completeness (25%)

  • Word count and content length
  • Structured content (chapters, sections)
  • Legal document patterns
  • Quality indicators

OCR Accuracy (20%)

  • Text quality and readability
  • Character recognition accuracy
  • Sentence structure quality
  • Formatting consistency

Data Freshness (15%)

  • Content age and timeliness
  • Update frequency
  • Historical relevance

Content Relevance (10%)

  • Legal terminology density
  • Domain-specific language
  • Official language indicators

Technical Quality (5%)

  • Document structure
  • Formatting consistency
  • Metadata quality
  • Content organization

Database Schema

scraped_items Table

CREATE TABLE scraped_items (
    id TEXT PRIMARY KEY,
    url TEXT NOT NULL,
    title TEXT,
    content TEXT,
    metadata TEXT,
    timestamp TEXT,
    source_url TEXT,
    rating_score REAL DEFAULT 0.0,
    processing_status TEXT DEFAULT 'pending',
    error_message TEXT,
    strategy_used TEXT,
    content_hash TEXT,
    word_count INTEGER DEFAULT 0,
    language TEXT DEFAULT 'unknown',
    domain TEXT
);

rating_results Table

CREATE TABLE rating_results (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    item_id TEXT NOT NULL,
    overall_score REAL,
    criteria_scores TEXT,
    rating_level TEXT,
    confidence REAL,
    timestamp TEXT,
    evaluator TEXT,
    notes TEXT,
    FOREIGN KEY (item_id) REFERENCES scraped_items (id)
);

scraping_jobs Table

CREATE TABLE scraping_jobs (
    job_id TEXT PRIMARY KEY,
    urls TEXT,
    strategy TEXT,
    keywords TEXT,
    content_types TEXT,
    max_depth INTEGER DEFAULT 1,
    delay_between_requests REAL DEFAULT 1.0,
    timeout INTEGER DEFAULT 30,
    created_at TEXT,
    status TEXT DEFAULT 'pending',
    total_items INTEGER DEFAULT 0,
    completed_items INTEGER DEFAULT 0,
    failed_items INTEGER DEFAULT 0
);

Configuration

Rating Configuration

from app.services.rating_service import RatingConfig

config = RatingConfig(
    source_credibility_weight=0.25,
    content_completeness_weight=0.25,
    ocr_accuracy_weight=0.20,
    data_freshness_weight=0.15,
    content_relevance_weight=0.10,
    technical_quality_weight=0.05,
    excellent_threshold=0.8,
    good_threshold=0.6,
    average_threshold=0.4,
    poor_threshold=0.2
)

Scraping Configuration

from app.services.scraping_service import ScrapingService

scraping_service = ScrapingService(
    db_path="legal_documents.db",
    max_workers=10,
    timeout=30,
    user_agent="Legal-Dashboard-Scraper/1.0"
)

Usage Examples

Starting a Scraping Job

import asyncio
from app.services.scraping_service import ScrapingService, ScrapingStrategy

async def scrape_legal_documents():
    service = ScrapingService()
    
    urls = [
        "https://court.gov.ir/document1",
        "https://justice.gov.ir/document2"
    ]
    
    job_id = await service.start_scraping_job(
        urls=urls,
        strategy=ScrapingStrategy.LEGAL_DOCUMENTS,
        keywords=["قرارداد", "contract", "agreement"],
        max_depth=1,
        delay=2.0
    )
    
    print(f"Started scraping job: {job_id}")

# Run the scraping job
asyncio.run(scrape_legal_documents())

Rating Scraped Items

import asyncio
from app.services.rating_service import RatingService

async def rate_items():
    service = RatingService()
    
    # Get scraped items
    items = await scraping_service.get_scraped_items()
    
    # Rate each item
    for item in items:
        if item['rating_score'] == 0.0:  # Unrated items
            result = await service.rate_item(item)
            print(f"Rated item {item['id']}: {result.rating_level.value} ({result.overall_score})")

# Run the rating process
asyncio.run(rate_items())

API Integration

import requests

# Start a scraping job
response = requests.post("http://localhost:8000/api/scrape", json={
    "urls": ["https://example.com/legal-doc"],
    "strategy": "legal_documents",
    "max_depth": 1
})

job_id = response.json()["job_id"]

# Monitor job progress
while True:
    status_response = requests.get(f"http://localhost:8000/api/scrape/status/{job_id}")
    status = status_response.json()
    
    if status["status"] == "completed":
        break
    
    time.sleep(5)

# Get rated items
items_response = requests.get("http://localhost:8000/api/scrape/items")
items = items_response.json()

# Get rating summary
summary_response = requests.get("http://localhost:8000/api/rating/summary")
summary = summary_response.json()

Testing

Running Tests

# Run all tests
pytest tests/test_scraping_system.py -v

# Run specific test categories
pytest tests/test_scraping_system.py::TestScrapingService -v
pytest tests/test_scraping_system.py::TestRatingService -v
pytest tests/test_scraping_system.py::TestScrapingAPI -v

# Run with coverage
pytest tests/test_scraping_system.py --cov=app.services --cov-report=html

Test Categories

  • Unit Tests: Individual component testing
  • Integration Tests: End-to-end workflow testing
  • API Tests: REST API endpoint testing
  • Performance Tests: Load and stress testing
  • Error Handling Tests: Exception and error scenario testing

Monitoring & Logging

Log Levels

  • INFO: General operational information
  • WARNING: Non-critical issues and warnings
  • ERROR: Error conditions and failures
  • DEBUG: Detailed debugging information

Key Metrics

  • Scraping Jobs: Active jobs, completion rates, failure rates
  • Data Quality: Average ratings, rating distributions, quality trends
  • System Performance: Response times, throughput, resource usage
  • Error Rates: Failed requests, parsing errors, rating failures

Health Checks

# Check system health
curl http://localhost:8000/api/health

# Check scraping service health
curl http://localhost:8000/api/scrape/statistics

# Check rating service health
curl http://localhost:8000/api/rating/summary

Troubleshooting

Common Issues

1. Scraping Jobs Not Starting

Symptoms: Jobs remain in "pending" status Solutions:

  • Check network connectivity
  • Verify URL accessibility
  • Review rate limiting settings
  • Check server logs for errors

2. Low Rating Scores

Symptoms: Items consistently getting low ratings Solutions:

  • Review content quality and completeness
  • Check source credibility settings
  • Adjust rating criteria weights
  • Verify OCR accuracy for text extraction

3. Database Errors

Symptoms: Database connection failures or data corruption Solutions:

  • Check database file permissions
  • Verify SQLite installation
  • Review database schema
  • Check for disk space issues

4. Performance Issues

Symptoms: Slow response times or high resource usage Solutions:

  • Reduce concurrent scraping jobs
  • Increase delay between requests
  • Optimize database queries
  • Review memory usage patterns

Debug Mode

Enable debug logging for detailed troubleshooting:

import logging
logging.basicConfig(level=logging.DEBUG)

Error Recovery

The system includes automatic error recovery mechanisms:

  • Job Retry: Failed scraping jobs can be retried
  • Data Validation: Automatic validation of scraped content
  • Graceful Degradation: System continues operating with partial failures
  • Error Logging: Comprehensive error logging for analysis

Security Considerations

Data Protection

  • Encryption: Sensitive data encrypted at rest
  • Access Control: API authentication and authorization
  • Input Validation: Comprehensive input sanitization
  • Rate Limiting: Protection against abuse

Privacy Compliance

  • Data Retention: Configurable data retention policies
  • User Consent: Respect for website terms of service
  • Data Minimization: Only necessary data is collected
  • Right to Deletion: User data can be deleted on request

Network Security

  • HTTPS: All communications encrypted
  • Certificate Validation: Proper SSL certificate validation
  • Firewall Rules: Network access controls
  • DDoS Protection: Rate limiting and traffic filtering

Performance Optimization

Scraping Performance

  • Async Processing: Non-blocking I/O operations
  • Connection Pooling: Reuse HTTP connections
  • Caching: Cache frequently accessed content
  • Parallel Processing: Multiple concurrent scraping jobs

Database Performance

  • Indexing: Optimized database indexes
  • Query Optimization: Efficient SQL queries
  • Connection Pooling: Database connection management
  • Data Archiving: Automatic archiving of old data

Memory Management

  • Streaming: Process large datasets in chunks
  • Garbage Collection: Proper memory cleanup
  • Resource Limits: Configurable memory limits
  • Monitoring: Real-time memory usage tracking

Future Enhancements

Planned Features

  • Machine Learning: Advanced content classification
  • Natural Language Processing: Enhanced text analysis
  • Multi-language Support: Additional language support
  • Cloud Integration: Cloud storage and processing
  • Advanced Analytics: Detailed analytics and reporting

Scalability Improvements

  • Microservices Architecture: Service decomposition
  • Load Balancing: Distributed processing
  • Caching Layer: Redis integration
  • Message Queues: Asynchronous processing

Support & Contributing

Getting Help

  • Documentation: Comprehensive documentation and examples
  • Community: Active community support
  • Issues: GitHub issue tracking
  • Discussions: Community discussions and Q&A

Contributing

  • Code Standards: Follow PEP 8 and project guidelines
  • Testing: Include comprehensive tests
  • Documentation: Update documentation for changes
  • Review Process: Code review and approval process

License

This project is licensed under the MIT License. See LICENSE file for details.


Note: This documentation is continuously updated. For the latest version, please check the project repository.