Nikeytas/Videomae Crime Detector Production V1

This model is a fine-tuned version of MCG-NJU/videomae-base on the UCF Crime dataset with event-based binary classification. It achieves the following results on the evaluation set:

Loss: 0.8070
Accuracy: 0.6250
Precision: 0.6351
Recall: 0.6250
F1 Score: 0.6114

🎯 Model Overview

This VideoMAE model has been fine-tuned for binary violence detection in video content. The model classifies videos into two categories:

Violent Crime (1): Videos containing violent criminal activities
Non-Violent Incident (0): Videos with non-violent or normal activities

The model is based on the VideoMAE architecture and has been specifically trained on a curated subset of the UCF Crime dataset with event-based categorization for realistic crime detection scenarios.

📊 Dataset & Training

Dataset Composition

Total Videos: 300

Violent Crime Videos: 150
Non-Violent Incident Videos: 150

Class Balance: 50.0% violent crimes

Event Distribution:

Abuse: 34 videos
Arrest: 36 videos
Arson: 46 videos
Assault: 36 videos
Burglary: 70 videos
Explosion: 24 videos
Fighting: 30 videos
RoadAccidents: 86 videos
Robbery: 98 videos
Shoplifting: 36 videos
Stealing: 62 videos

Data Splits:

Training: 192 videos
Validation: 48 videos
Test: 60 videos

🎯 Performance

Performance Metrics

Validation Performance:

eval_loss: 0.8070
eval_accuracy: 0.6250
eval_precision: 0.6351
eval_recall: 0.6250
eval_f1: 0.6114
eval_runtime: 6.4319
eval_samples_per_second: 7.4630
eval_steps_per_second: 3.7310
epoch: 10.0000

Test Performance:

eval_loss: 0.6541
eval_accuracy: 0.6667
eval_precision: 0.6667
eval_recall: 0.6667
eval_f1: 0.6667
eval_runtime: 8.0508
eval_samples_per_second: 7.4530
eval_steps_per_second: 3.7260
epoch: 10.0000

Training Information:

Training Time: 19.8 minutes
Best Accuracy Achieved: 0.6667
Model Architecture: VideoMAE Base (fine-tuned)
Fine-tuning Approach: Event-based binary classification

🚀 Training Procedure

Training Hyperparameters

The following hyperparameters were used during training:

Learning Rate: 5e-05
Train Batch Size: 2
Eval Batch Size: 2
Optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
LR Scheduler Type: Linear
Training Epochs: 10
Weight Decay: 0.01

Training Results

Training Loss	Epoch	Step	Validation Loss	Accuracy
0.6666666666666666	10.00	N/A	0.8070	0.6250

Framework Versions

Transformers: 4.30.2+
PyTorch: 2.0.1+
Datasets: Latest
Device: Apple Silicon MPS / CUDA / CPU (Auto-detected)

🚀 Quick Start

Installation

pip install transformers torch torchvision opencv-python pillow

Basic Usage

import torch
from transformers import AutoModelForVideoClassification, AutoProcessor
import cv2
import numpy as np

# Load model and processor
model = AutoModelForVideoClassification.from_pretrained("Nikeytas/videomae-crime-detector-production-v1")
processor = AutoProcessor.from_pretrained("Nikeytas/videomae-crime-detector-production-v1")

# Process video
def classify_video(video_path, num_frames=16):
    # Extract frames
    cap = cv2.VideoCapture(video_path)
    frames = []
    
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)
    
    for idx in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if ret:
            frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frames.append(frame_rgb)
    
    cap.release()
    
    # Process with model
    inputs = processor(frames, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
        predicted_class = torch.argmax(predictions, dim=-1).item()
        confidence = predictions[0][predicted_class].item()
    
    label = "Violent Crime" if predicted_class == 1 else "Non-Violent"
    return label, confidence

# Example usage
video_path = "path/to/your/video.mp4"
prediction, confidence = classify_video(video_path)
print(f"Prediction: {prediction} (Confidence: {confidence:.3f})")

Batch Processing

import os
from pathlib import Path

def process_video_directory(video_dir, output_file="results.txt"):
    results = []
    
    for video_file in Path(video_dir).glob("*.mp4"):
        try:
            prediction, confidence = classify_video(str(video_file))
            results.append({
                "file": video_file.name,
                "prediction": prediction,
                "confidence": confidence
            })
            print(f"✅ {video_file.name}: {prediction} ({confidence:.3f})")
        except Exception as e:
            print(f"❌ Error processing {video_file.name}: {e}")
    
    # Save results
    with open(output_file, "w") as f:
        for result in results:
            f.write(f"{result['file']}: {result['prediction']} ({result['confidence']:.3f})\n")
    
    return results

# Process all videos in a directory
results = process_video_directory("./videos/")

📈 Technical Specifications

Base Model: MCG-NJU/videomae-base
Architecture: Vision Transformer (ViT) adapted for video
Input Resolution: 224x224 pixels per frame
Temporal Resolution: 16 frames per video clip
Output Classes: 2 (Binary classification)
Training Framework: HuggingFace Transformers
Optimization: AdamW optimizer with learning rate 5e-5

⚠️ Limitations

Dataset Scope: Trained on a subset of UCF Crime dataset - may not generalize to all types of violence
Temporal Context: Uses 16-frame clips which may miss context in longer sequences
Environmental Bias: Performance may vary with different lighting, camera angles, and video quality
False Positives: May misclassify intense but non-violent activities (sports, action movies)
Real-time Performance: Processing time depends on hardware capabilities

🔒 Ethical Considerations

Intended Use

Primary: Research and development in video analysis
Secondary: Security system enhancement with human oversight
Educational: Computer vision and AI safety research

Prohibited Uses

Surveillance without consent: Do not use for unauthorized monitoring
Discriminatory profiling: Avoid bias against specific groups or communities
Automated punishment: Never use for automated legal or disciplinary actions
Privacy violation: Respect privacy laws and individual rights

Bias and Fairness

Model trained on specific dataset that may not represent all populations
Regular evaluation needed for bias detection and mitigation
Human oversight required for critical applications
Consider demographic representation in deployment scenarios

📝 Model Card Information

Developed by: Research Team
Model Type: Video Classification (Binary)
Training Data: UCF Crime Dataset (Subset)
Training Date: 2025-06-01 23:46:55 UTC
Evaluation Metrics: Accuracy, Precision, Recall, F1-Score
Intended Users: Researchers, Security Professionals, Developers

📚 Citation

If you use this model in your research, please cite:

@misc{Nikeytas_videomae_crime_detector_production_v1,
    title={VideoMAE Fine-tuned for Crime Detection},
    author={Research Team},
    year={2024},
    publisher={Hugging Face},
    url={https://huggingface.co/Nikeytas/videomae-crime-detector-production-v1}
}

🤝 Contributing

We welcome contributions to improve the model! Please:

Report issues with specific examples
Suggest improvements for bias reduction
Share evaluation results on new datasets
Contribute to documentation and examples

📞 Contact

For questions, issues, or collaboration opportunities, please open an issue in the model repository or contact the development team.

Last updated: 2025-06-01 23:46:55 UTC Model version: 1.0 Framework: HuggingFace Transformers

Nikeytas
/

videomae-crime-detector-production-v1