File size: 3,406 Bytes
07961af 9b9c488 07961af 13a34d9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
# **Image Captioning Models with ResNet50+LSTM, ViT+BERT, and ViT+GPT2**
This repository contains the implementation of three advanced image captioning models:
1. **ResNet50 + LSTM**: A classic approach using Convolutional Neural Networks (CNNs) for image encoding and LSTMs for sequential caption generation.
2. **Vision Transformer (ViT) + BERT**: A transformer-based approach leveraging Vision Transformers (ViT) for image encoding and BERT for text generation.
3. **Vision Transformer (ViT) + GPT2**: A generative model combining ViT for image encoding with GPT2’s autoregressive capabilities for text generation.
Each model integrates a robust visual encoder and a natural language processing decoder to generate descriptive captions for input images.
---
## **Data**
This project uses the [COCO dataset](https://cocodataset.org/) for training and evaluation, which consists of images with multiple human-annotated captions.
---
## **Hyperparameters**
The following table summarizes the key training configurations used for each model:
| **Parameter** | **ResNet50 + LSTM** | **ViT + BERT** | **ViT + GPT2** |
|-------------------|---------------------|-----------------|-----------------|
| **Epochs** | 10 | 10 | 10 |
| **Batch Size** | 128 | 32 | 32 |
| **Learning Rate** | 0.0001 | 0.00001 | 0.00001 |
| **Optimizer** | Adam | Adam | Adam |
| **Scheduler** | N/A | OneCycleLR | OneCycleLR |
---
## **Evaluation Results**
The models were evaluated using popular metrics for image captioning: **BLEU (1-4)**, **METEOR**, and **ROUGE-L**. The table below provides the performance scores for each model:
| **Model** | **BLEU-1** | **BLEU-2** | **BLEU-3** | **BLEU-4** | **METEOR** | **ROUGE-L** |
|--------------------|------------|------------|------------|------------|------------|-------------|
| **ResNet50 + LSTM**| 0.648 | 0.451 | 0.300 | 0.202 | 0.421 | 0.506 |
| **ViT + BERT** | 0.725 | **0.551** | **0.395** | **0.278** | 0.501 | **0.546** |
| **ViT + GPT2** | **0.728** | 0.545 | 0.385 | 0.265 | **0.502** | 0.532 |
---
## **Inference Example**
Below is an example of how the models perform on a given image. The table shows the reference caption and the predicted captions generated by each model.
<table>
<tr>
<th>Image</th>
<th>Reference Caption</th>
<th>Predicted Caption</th>
</tr>
<tr>
<td>
<img src="examples/000000166391.jpg" alt="Traffic light" width="300">
</td>
<td>
<ol>
<li>Traffic is stopped at a red stop light.</li>
<li>Cars are stopped at a traffic light on a highway.</li>
<li>A number of red and green traffic lights on a wide highway.</li>
<li>A large and wide street covered in lots of traffic lights.</li>
<li>A traffic light and intersection with cars traveling in both directions on the street.</li>
</ol>
</td>
<td>
<b>ResNet50 + LSTM:</b> a traffic light with a street sign on it.<br>
<b>ViT + BERT:</b> a bunch of traffic lights hanging from a wire.<br>
<b>ViT + GPT2:</b> A green traffic light hanging over a street.
</td>
</tr>
</table>
|