File size: 4,825 Bytes
feee39d dd98f48 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 |
---
language: en
tags:
- deepfake-detection
- computer-vision
- ensemble-learning
- pytorch
- vision-transformer
- cnn
- image-classification
datasets:
- Hemg/deepfake-and-real-images
metrics:
- accuracy
- precision
- recall
- f1
library_name: pytorch
pipeline_tag: image-classification
license: mit
model-index:
- name: CNN-ViT-Ensemble-Deepfake-Detector
results:
- task:
type: image-classification
name: Deepfake Detection
dataset:
type: Hemg/deepfake-and-real-images
name: Deepfake and Real Images Dataset
metrics:
- type: accuracy
value: 94.87
name: Test Accuracy
- type: f1
value: 0.94
name: F1 Score
- type: precision
value: 0.95
name: Precision
- type: recall
value: 0.94
name: Recall
widget:
- text: image-classification
---
# Ensemble-Based Deep Learning Architecture for Deepfake Detection
## Abstract
This research presents a novel ensemble-based approach for detecting deepfake images using a combination of Convolutional Neural Networks (CNNs) and Vision Transformers (ViT). The system achieves 94.87% accuracy by leveraging three complementary architectures: a 12-layer CNN, a lightweight 6-layer CNN, and a hybrid CNN-ViT model. Our approach demonstrates robust performance in distinguishing between real and manipulated facial images.
## 1. Introduction
With the increasing sophistication of deepfake technology, detecting manipulated images has become crucial for maintaining digital media integrity. This work introduces an ensemble method that combines traditional CNN architectures with modern Vision Transformers to create a robust detection system.
## 2. Architecture
### 2.1 Model Components
The system consists of three distinct models:
1. **Model A (12-layer CNN)**
- Three convolutional blocks
- Each block: 2 conv layers + BatchNorm + ReLU + pooling
- Input size: 50x50 pixels
- Dropout rate: 0.3
2. **Model B (6-layer CNN)**
- Lightweight architecture
- Three simple conv layers with pooling
- Input size: 50x50 pixels
- Dropout rate: 0.3
3. **Model C (CNN-ViT Hybrid)**
- CNN feature extractor
- Vision Transformer (base-16 architecture)
- Input size: 224x224 pixels
- Pretrained ViT backbone
### 2.2 Ensemble Strategy
The final prediction is determined through majority voting among the three models, enhancing robustness and reducing individual model biases.
## 3. Implementation Details
### 3.1 Dataset
- Dataset: Hemg/deepfake-and-real-images
- Split: 80% training, 20% testing
- Data augmentation: Resize, normalization
### 3.2 Training Parameters
- Optimizer: Adam
- Learning rate: 1e-4
- Batch size: 32
- Epochs: 10
- Loss function: Cross-Entropy
## 4. Results
### 4.1 Performance Metrics
Based on the test set evaluation:
- **Overall Accuracy**: 94.87%
- **Classification Report**:
- Real Images:
- Precision: 0.95
- Recall: 0.94
- F1-score: 0.94
- Fake Images:
- Precision: 0.94
- Recall: 0.95
- F1-score: 0.95
### 4.2 Deployment
The system is deployed as a FastAPI service, providing real-time inference with confidence scores.
### 4.3 Visuals

*Figure 1: Findings of the proposed ensemble-based deepfake detection system*
#### Performance Visualization

*Figure 2: Confusion matrix showing model performance on test set*
#### Loss Vs Epochs

*Figure 3: Loss vs Epochs for individual models*
#### Accuracy Vs Epochs

*Figure 4: Accuracy vs Epochs for individual models*
## 5. Technical Requirements
- Python 3.x
- PyTorch
- timm
- FastAPI
- PIL
- scikit-learn
## 6. Usage
### 6.1 API Endpoint
```python
POST /predict/
Input: Image file
Output: {
"prediction": "Real/Fake",
"confidence": "percentage"
}
```
### 6.2 Model Training
Run cnn-vit file to train models on your custom dataset!
## 7. Conclusions
The ensemble approach demonstrates superior performance in deepfake detection, with the combination of traditional CNNs and modern Vision Transformers providing robust and reliable results. The system's high accuracy and balanced precision-recall metrics make it suitable for real-world applications. Although for model C it doesn't perform well on Epoch 10 but still overall Result is Good.
## 8. Future Work
- Integration of attention mechanisms in CNN models
- Exploration of different ensemble strategies
- Extension to video deepfake detection
- Investigation of model compression techniques
## References
1. Vision Transformer (ViT) - Dosovitskiy et al., 2020
2. timm library - Ross Wightman
3. FastAPI - Sebastián Ramírez
## License
MIT License
|