Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# **Image Captioning Models with ResNet50+LSTM, ViT+BERT, and ViT+GPT2**
|
2 |
+
|
3 |
+
This repository contains the implementation of three advanced image captioning models:
|
4 |
+
1. **ResNet50 + LSTM**: A classic approach using Convolutional Neural Networks (CNNs) for image encoding and LSTMs for sequential caption generation.
|
5 |
+
2. **Vision Transformer (ViT) + BERT**: A transformer-based approach leveraging Vision Transformers (ViT) for image encoding and BERT for text generation.
|
6 |
+
3. **Vision Transformer (ViT) + GPT2**: A generative model combining ViT for image encoding with GPT2’s autoregressive capabilities for text generation.
|
7 |
+
|
8 |
+
Each model integrates a robust visual encoder and a natural language processing decoder to generate descriptive captions for input images.
|
9 |
+
|
10 |
+
---
|
11 |
+
|
12 |
+
## **Hyperparameters**
|
13 |
+
|
14 |
+
The following table summarizes the key training configurations used for each model:
|
15 |
+
|
16 |
+
| **Parameter** | **ResNet50 + LSTM** | **ViT + BERT** | **ViT + GPT2** |
|
17 |
+
|-------------------|---------------------|-----------------|-----------------|
|
18 |
+
| **Epochs** | 10 | 10 | 10 |
|
19 |
+
| **Batch Size** | 128 | 32 | 32 |
|
20 |
+
| **Learning Rate** | 0.0001 | 0.00001 | 0.00001 |
|
21 |
+
| **Optimizer** | Adam | Adam | Adam |
|
22 |
+
| **Scheduler** | N/A | OneCycleLR | OneCycleLR |
|
23 |
+
|
24 |
+
---
|
25 |
+
|
26 |
+
## **Evaluation Results**
|
27 |
+
|
28 |
+
The models were evaluated using popular metrics for image captioning: **BLEU (1-4)**, **METEOR**, and **ROUGE-L**. The table below provides the performance scores for each model:
|
29 |
+
|
30 |
+
| **Model** | **BLEU-1** | **BLEU-2** | **BLEU-3** | **BLEU-4** | **METEOR** | **ROUGE-L** |
|
31 |
+
|--------------------|------------|------------|------------|------------|------------|-------------|
|
32 |
+
| **ResNet50 + LSTM**| 0.648 | 0.451 | 0.300 | 0.202 | 0.421 | 0.506 |
|
33 |
+
| **ViT + BERT** | 0.725 | **0.551** | **0.395** | **0.278** | 0.501 | **0.546** |
|
34 |
+
| **ViT + GPT2** | **0.728** | 0.545 | 0.385 | 0.265 | **0.502** | 0.532 |
|
35 |
+
|
36 |
+
---
|