Nikeytas/Videomae Crime Detector Production V1
This model is a fine-tuned version of MCG-NJU/videomae-base on the UCF Crime dataset with event-based binary classification. It achieves the following results on the evaluation set:
- Loss: 0.8070
- Accuracy: 0.6250
- Precision: 0.6351
- Recall: 0.6250
- F1 Score: 0.6114
π― Model Overview
This VideoMAE model has been fine-tuned for binary violence detection in video content. The model classifies videos into two categories:
- Violent Crime (1): Videos containing violent criminal activities
- Non-Violent Incident (0): Videos with non-violent or normal activities
The model is based on the VideoMAE architecture and has been specifically trained on a curated subset of the UCF Crime dataset with event-based categorization for realistic crime detection scenarios.
π Dataset & Training
Dataset Composition
Total Videos: 300
- Violent Crime Videos: 150
- Non-Violent Incident Videos: 150
Class Balance: 50.0% violent crimes
Event Distribution:
- Abuse: 34 videos
- Arrest: 36 videos
- Arson: 46 videos
- Assault: 36 videos
- Burglary: 70 videos
- Explosion: 24 videos
- Fighting: 30 videos
- RoadAccidents: 86 videos
- Robbery: 98 videos
- Shoplifting: 36 videos
- Stealing: 62 videos
Data Splits:
- Training: 192 videos
- Validation: 48 videos
- Test: 60 videos
π― Performance
Performance Metrics
Validation Performance:
- eval_loss: 0.8070
- eval_accuracy: 0.6250
- eval_precision: 0.6351
- eval_recall: 0.6250
- eval_f1: 0.6114
- eval_runtime: 6.4319
- eval_samples_per_second: 7.4630
- eval_steps_per_second: 3.7310
- epoch: 10.0000
Test Performance:
- eval_loss: 0.6541
- eval_accuracy: 0.6667
- eval_precision: 0.6667
- eval_recall: 0.6667
- eval_f1: 0.6667
- eval_runtime: 8.0508
- eval_samples_per_second: 7.4530
- eval_steps_per_second: 3.7260
- epoch: 10.0000
Training Information:
- Training Time: 19.8 minutes
- Best Accuracy Achieved: 0.6667
- Model Architecture: VideoMAE Base (fine-tuned)
- Fine-tuning Approach: Event-based binary classification
π Training Procedure
Training Hyperparameters
The following hyperparameters were used during training:
- Learning Rate: 5e-05
- Train Batch Size: 2
- Eval Batch Size: 2
- Optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
- LR Scheduler Type: Linear
- Training Epochs: 10
- Weight Decay: 0.01
Training Results
Training Loss | Epoch | Step | Validation Loss | Accuracy |
---|---|---|---|---|
0.6666666666666666 | 10.00 | N/A | 0.8070 | 0.6250 |
Framework Versions
- Transformers: 4.30.2+
- PyTorch: 2.0.1+
- Datasets: Latest
- Device: Apple Silicon MPS / CUDA / CPU (Auto-detected)
π Quick Start
Installation
pip install transformers torch torchvision opencv-python pillow
Basic Usage
import torch
from transformers import AutoModelForVideoClassification, AutoProcessor
import cv2
import numpy as np
# Load model and processor
model = AutoModelForVideoClassification.from_pretrained("Nikeytas/videomae-crime-detector-production-v1")
processor = AutoProcessor.from_pretrained("Nikeytas/videomae-crime-detector-production-v1")
# Process video
def classify_video(video_path, num_frames=16):
# Extract frames
cap = cv2.VideoCapture(video_path)
frames = []
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)
for idx in indices:
cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
ret, frame = cap.read()
if ret:
frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
frames.append(frame_rgb)
cap.release()
# Process with model
inputs = processor(frames, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(predictions, dim=-1).item()
confidence = predictions[0][predicted_class].item()
label = "Violent Crime" if predicted_class == 1 else "Non-Violent"
return label, confidence
# Example usage
video_path = "path/to/your/video.mp4"
prediction, confidence = classify_video(video_path)
print(f"Prediction: {prediction} (Confidence: {confidence:.3f})")
Batch Processing
import os
from pathlib import Path
def process_video_directory(video_dir, output_file="results.txt"):
results = []
for video_file in Path(video_dir).glob("*.mp4"):
try:
prediction, confidence = classify_video(str(video_file))
results.append({
"file": video_file.name,
"prediction": prediction,
"confidence": confidence
})
print(f"β
{video_file.name}: {prediction} ({confidence:.3f})")
except Exception as e:
print(f"β Error processing {video_file.name}: {e}")
# Save results
with open(output_file, "w") as f:
for result in results:
f.write(f"{result['file']}: {result['prediction']} ({result['confidence']:.3f})\n")
return results
# Process all videos in a directory
results = process_video_directory("./videos/")
π Technical Specifications
- Base Model: MCG-NJU/videomae-base
- Architecture: Vision Transformer (ViT) adapted for video
- Input Resolution: 224x224 pixels per frame
- Temporal Resolution: 16 frames per video clip
- Output Classes: 2 (Binary classification)
- Training Framework: HuggingFace Transformers
- Optimization: AdamW optimizer with learning rate 5e-5
β οΈ Limitations
- Dataset Scope: Trained on a subset of UCF Crime dataset - may not generalize to all types of violence
- Temporal Context: Uses 16-frame clips which may miss context in longer sequences
- Environmental Bias: Performance may vary with different lighting, camera angles, and video quality
- False Positives: May misclassify intense but non-violent activities (sports, action movies)
- Real-time Performance: Processing time depends on hardware capabilities
π Ethical Considerations
Intended Use
- Primary: Research and development in video analysis
- Secondary: Security system enhancement with human oversight
- Educational: Computer vision and AI safety research
Prohibited Uses
- Surveillance without consent: Do not use for unauthorized monitoring
- Discriminatory profiling: Avoid bias against specific groups or communities
- Automated punishment: Never use for automated legal or disciplinary actions
- Privacy violation: Respect privacy laws and individual rights
Bias and Fairness
- Model trained on specific dataset that may not represent all populations
- Regular evaluation needed for bias detection and mitigation
- Human oversight required for critical applications
- Consider demographic representation in deployment scenarios
π Model Card Information
- Developed by: Research Team
- Model Type: Video Classification (Binary)
- Training Data: UCF Crime Dataset (Subset)
- Training Date: 2025-06-01 23:46:55 UTC
- Evaluation Metrics: Accuracy, Precision, Recall, F1-Score
- Intended Users: Researchers, Security Professionals, Developers
π Citation
If you use this model in your research, please cite:
@misc{Nikeytas_videomae_crime_detector_production_v1,
title={VideoMAE Fine-tuned for Crime Detection},
author={Research Team},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/Nikeytas/videomae-crime-detector-production-v1}
}
π€ Contributing
We welcome contributions to improve the model! Please:
- Report issues with specific examples
- Suggest improvements for bias reduction
- Share evaluation results on new datasets
- Contribute to documentation and examples
π Contact
For questions, issues, or collaboration opportunities, please open an issue in the model repository or contact the development team.
Last updated: 2025-06-01 23:46:55 UTC Model version: 1.0 Framework: HuggingFace Transformers
- Downloads last month
- 29
Model tree for Nikeytas/videomae-crime-detector-production-v1
Base model
MCG-NJU/videomae-baseDataset used to train Nikeytas/videomae-crime-detector-production-v1
Evaluation results
- Accuracy on UCF Crime Dataset (Subset)self-reported0.625
- Precision on UCF Crime Dataset (Subset)self-reported0.635
- Recall on UCF Crime Dataset (Subset)self-reported0.625
- F1 on UCF Crime Dataset (Subset)self-reported0.611