metadata

license: mit
tags:
  - video-classification
  - I3D
  - action-recognition
  - anomaly-detection
datasets:
  - kinetics-400
  - ucf-crime
model-index:
  - name: i3d_ucf_finetuned
    results:
      - task:
          type: video-classification
        dataset:
          name: UCF-Crime
          type: ucf-crime
        metrics:
          - name: Validation Accuracy
            type: accuracy
            value: 0.6667

I3D UCF Finetuned

Model Description

This is a finetuned I3D (Inflated 3D ConvNet) model for video classification, based on the i3d_r50 architecture from PyTorchVideo. The I3D model uses a ResNet-50 backbone inflated to 3D convolutions to capture both spatial and temporal features from videos. It was originally pretrained on the Kinetics-400 dataset, which contains ~306,245 short videos across 400 human action classes (e.g., running, dancing, cooking).

The model was finetuned on the UCF-Crime dataset to classify videos into 8 specific categories: arrest, Explosion, Fight, normal, roadaccidents, shooting, Stealing, vandalism. During finetuning, the final fully connected layer was modified to output 8 classes, and a Dropout layer (p=0.3) was added to reduce overfitting. The finetuned weights are stored in i3d_ucf_finetuned.pth (109 MB) and can be downloaded from this repository.

Dataset

Pretraining Dataset

Kinetics-400: A large-scale dataset with ~306,245 videos covering 400 human action classes. It provides robust general features for video understanding, making it an excellent starting point for finetuning.

Finetuning Dataset

UCF-Crime: A dataset for anomaly detection in videos, containing ~~1,900 videos (~~1,610 for training, 290 for testing). The model was finetuned on a subset of UCF-Crime to classify videos into 8 categories: arrest, Explosion, Fight, normal, roadaccidents, shooting, Stealing, vandalism.

Performance

The model was finetuned for 30 epochs. Below are the training and validation performance plots:

Training and Validation Accuracy

Best Validation Accuracy: ~66.67% (achieved after finetuning on UCF-Crime).
Training Accuracy: Reached ~81.03% .

Training and Validation Loss

The training loss decreases steadily, while the validation loss shows some fluctuations, indicating potential room for improving generalization.

Usage

To use the model for video classification, you can load the weights from this repository using the following code:

import torch
import cv2
import numpy as np
import torch.nn as nn
from huggingface_hub import hf_hub_download

# Define the model
def load_i3d_ucf_finetuned(repo_id="Ahmeddawood0001/i3d_ucf_finetuned", filename="i3d_ucf_finetuned.pth"):
    class I3DClassifier(nn.Module):
        def __init__(self, num_classes):
            super(I3DClassifier, self).__init__()
            self.i3d = torch.hub.load('facebookresearch/pytorchvideo', 'i3d_r50', pretrained=True)
            self.dropout = nn.Dropout(0.3)
            self.i3d.blocks[6].proj = nn.Linear(2048, num_classes)
        def forward(self, x):
            x = self.i3d(x)
            x = self.dropout(x)
            return x
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = I3DClassifier(num_classes=8).to(device)
    weights_path = hf_hub_download(repo_id=repo_id, filename=filename)
    model.load_state_dict(torch.load(weights_path))
    model.eval()
    return model

# Define frame extraction function
def extract_frames(video_path, max_frames=32, frame_size=(224, 224)):
    cap = cv2.VideoCapture(video_path)
    frames = []
    while len(frames) < max_frames:
        ret, frame = cap.read()
        if not ret:
            break
        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        frame = cv2.resize(frame, frame_size)
        frames.append(frame)
    while len(frames) < max_frames:
        frames.append(frames[-1])
    frames = frames[:max_frames]
    frames = np.stack(frames)
    frames = torch.from_numpy(frames).permute(0, 3, 1, 2).float() / 255.0
    frames = frames.permute(1, 0, 2, 3)
    cap.release()
    return frames

# Define classification function
def classify_video(video_path, model, labels):
    frames = extract_frames(video_path)
    frames = frames.unsqueeze(0).to(device)
    with torch.no_grad():
        outputs = model(frames)
        probabilities = torch.softmax(outputs, dim=1)
        predicted_idx = torch.argmax(probabilities, dim=1).item()
        predicted_label = labels[predicted_idx]
        confidence = probabilities[0, predicted_idx].item()
    return predicted_label, confidence

# Example usage
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
labels = ["arrest", "Explosion", "Fight", "normal", "roadaccidents", "shooting", "Stealing", "vandalism"]
model = load_i3d_ucf_finetuned()
video_path = "path/to/your/video.mp4"  # Replace with your video path
predicted_label, confidence = classify_video(video_path, model, labels)
print(f"Video: {video_path}")
print(f"Predicted Label: {predicted_label}")

print(f"Confidence: {confidence:.4f}")