File size: 3,406 Bytes

# **Image Captioning Models with ResNet50+LSTM, ViT+BERT, and ViT+GPT2**

This repository contains the implementation of three advanced image captioning models:
1. **ResNet50 + LSTM**: A classic approach using Convolutional Neural Networks (CNNs) for image encoding and LSTMs for sequential caption generation.
2. **Vision Transformer (ViT) + BERT**: A transformer-based approach leveraging Vision Transformers (ViT) for image encoding and BERT for text generation.
3. **Vision Transformer (ViT) + GPT2**: A generative model combining ViT for image encoding with GPT2’s autoregressive capabilities for text generation.

Each model integrates a robust visual encoder and a natural language processing decoder to generate descriptive captions for input images.

---

## **Data**

This project uses the [COCO dataset](https://cocodataset.org/) for training and evaluation, which consists of images with multiple human-annotated captions.

---

## **Hyperparameters**

The following table summarizes the key training configurations used for each model:

| **Parameter**    | **ResNet50 + LSTM** | **ViT + BERT**  | **ViT + GPT2**  |
|-------------------|---------------------|-----------------|-----------------|
| **Epochs**       | 10                  | 10              | 10              |
| **Batch Size**    | 128                 | 32              | 32              |
| **Learning Rate** | 0.0001              | 0.00001         | 0.00001         |
| **Optimizer**     | Adam                | Adam            | Adam            |
| **Scheduler**     | N/A                 | OneCycleLR      | OneCycleLR      |

---

## **Evaluation Results**

The models were evaluated using popular metrics for image captioning: **BLEU (1-4)**, **METEOR**, and **ROUGE-L**. The table below provides the performance scores for each model:

| **Model**         | **BLEU-1** | **BLEU-2** | **BLEU-3** | **BLEU-4** | **METEOR** | **ROUGE-L** |
|--------------------|------------|------------|------------|------------|------------|-------------|
| **ResNet50 + LSTM**| 0.648      | 0.451      | 0.300      | 0.202      | 0.421      | 0.506       |
| **ViT + BERT**     | 0.725      | **0.551**  | **0.395**  | **0.278**  | 0.501      | **0.546**   |
| **ViT + GPT2**     | **0.728**  | 0.545      | 0.385      | 0.265      | **0.502**  | 0.532       |

---

## **Inference Example**

Below is an example of how the models perform on a given image. The table shows the reference caption and the predicted captions generated by each model.

<table>
  <tr>
    <th>Image</th>
    <th>Reference Caption</th>
    <th>Predicted Caption</th>
  </tr>
  <tr>
    <td>
      <img src="examples/000000166391.jpg" alt="Traffic light" width="300">
    </td>
    <td>
      <ol>
        <li>Traffic is stopped at a red stop light.</li>
        <li>Cars are stopped at a traffic light on a highway.</li>
        <li>A number of red and green traffic lights on a wide highway.</li>
        <li>A large and wide street covered in lots of traffic lights.</li>
        <li>A traffic light and intersection with cars traveling in both directions on the street.</li>
      </ol>
    </td>
    <td>
      <b>ResNet50 + LSTM:</b> a traffic light with a street sign on it.<br>
      <b>ViT + BERT:</b> a bunch of traffic lights hanging from a wire.<br>
      <b>ViT + GPT2:</b> A green traffic light hanging over a street.
    </td>
  </tr>
</table>