File size: 2,298 Bytes
2d4f1b0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7a76894
2d4f1b0
 
 
 
 
 
 
 
0db24a7
 
 
2d4f1b0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
license: mit
datasets:
- abdallahwagih/ucf101-videos
metrics:
- accuracy
base_model:
- google/mobilenet_v2_1.0_224
pipeline_tag: video-classification

tags:
- action-recognition
- cnn-gru
- video-classification
- ucf101
- action
- mobilenetv2
- deep-learning
- pytorch
---

# Action Detection with CNN-GRU on MobileNetV2

## Overview

This model performs human action classification on videos using a CNN-GRU architecture built on top of **MobileNetV2 (1.0, 224)** features and trained on the [UCF101](https://www.kaggle.com/datasets/abdallahwagih/ucf101-videos) dataset.  
It is well-suited for recognizing actions from short trimmed video clips.

***

## Model Details

- **Base model:** `google/mobilenet_v2_1.0_224`
- **Architecture:** CNN-GRU

  ![CNN-GRU Architecture](./cnn_architecture.png)

- **Dataset:** UCF101 - Action Recognition Dataset (https://www.kaggle.com/datasets/abdallahwagih/ucf101-videos)
- **Task:** Video Classification (Action Recognition)
- **Metrics:** Accuracy
- **License:** MIT

***

## Usage

### Requirements

```bash
pip install torch torchvision opencv-python
```

### Example Code

```python
from action_model import load_action_model, preprocess_frames, predict_action
import cv2

# Load model
model = load_action_model(model_path="best_model.pt", device="cpu", num_classes=5)

# Read frames from video
cap = cv2.VideoCapture("path_to_video.mp4")
frames = []
while True:
    ret, frame = cap.read()
    if not ret:
        break
    frames.append(frame)
cap.release()

# Preprocess frames for model input
clip_tensor = preprocess_frames(frames[:16], seq_len=16, resize=(112,112))

# Predict action
result = predict_action(model, clip_tensor, device="cpu")
print(result)
```

***

## Training & Evaluation

- Trained on UCF101 split 1 with MobileNetV2 backbone.
- Sequence length: 16 frames per clip.
- Metric: Top-1 classification accuracy.

***

## Intended Use & Limitations

**Intended for:**
- Video analytics
- Educational research
- Baseline for video action recognition tasks

**Limitations:**
- Predicts only UCF101 subset classes
- Needs short, trimmed video clips
- Not robust to out-of-domain videos or very low-res input

***

## Tags

`action` 路 `cnn-gru` 路 `video-classification` 路 `ucf101` 路 `mobilenetv2` 路 `deep-learning` 路 `torch`