Image Captioning Models with ResNet50+LSTM, ViT+BERT, and ViT+GPT2
This repository contains the implementation of three advanced image captioning models:
- ResNet50 + LSTM: A classic approach using Convolutional Neural Networks (CNNs) for image encoding and LSTMs for sequential caption generation.
- Vision Transformer (ViT) + BERT: A transformer-based approach leveraging Vision Transformers (ViT) for image encoding and BERT for text generation.
- Vision Transformer (ViT) + GPT2: A generative model combining ViT for image encoding with GPT2โs autoregressive capabilities for text generation.
Each model integrates a robust visual encoder and a natural language processing decoder to generate descriptive captions for input images.
Data
This project uses the COCO dataset for training and evaluation, which consists of images with multiple human-annotated captions.
Hyperparameters
The following table summarizes the key training configurations used for each model:
Parameter | ResNet50 + LSTM | ViT + BERT | ViT + GPT2 |
---|---|---|---|
Epochs | 10 | 10 | 10 |
Batch Size | 128 | 32 | 32 |
Learning Rate | 0.0001 | 0.00001 | 0.00001 |
Optimizer | Adam | Adam | Adam |
Scheduler | N/A | OneCycleLR | OneCycleLR |
Evaluation Results
The models were evaluated using popular metrics for image captioning: BLEU (1-4), METEOR, and ROUGE-L. The table below provides the performance scores for each model:
Model | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROUGE-L |
---|---|---|---|---|---|---|
ResNet50 + LSTM | 0.648 | 0.451 | 0.300 | 0.202 | 0.421 | 0.506 |
ViT + BERT | 0.725 | 0.551 | 0.395 | 0.278 | 0.501 | 0.546 |
ViT + GPT2 | 0.728 | 0.545 | 0.385 | 0.265 | 0.502 | 0.532 |
Inference Example
Below is an example of how the models perform on a given image. The table shows the reference caption and the predicted captions generated by each model.
Image | Reference Caption | Predicted Caption |
---|---|---|
![]() |
|
ResNet50 + LSTM: a traffic light with a street sign on it. ViT + BERT: a bunch of traffic lights hanging from a wire. ViT + GPT2: A green traffic light hanging over a street. |