YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Image Captioning Models with ResNet50+LSTM, ViT+BERT, and ViT+GPT2

This repository contains the implementation of three advanced image captioning models:

  1. ResNet50 + LSTM: A classic approach using Convolutional Neural Networks (CNNs) for image encoding and LSTMs for sequential caption generation.
  2. Vision Transformer (ViT) + BERT: A transformer-based approach leveraging Vision Transformers (ViT) for image encoding and BERT for text generation.
  3. Vision Transformer (ViT) + GPT2: A generative model combining ViT for image encoding with GPT2โ€™s autoregressive capabilities for text generation.

Each model integrates a robust visual encoder and a natural language processing decoder to generate descriptive captions for input images.


Data

This project uses the COCO dataset for training and evaluation, which consists of images with multiple human-annotated captions.


Hyperparameters

The following table summarizes the key training configurations used for each model:

Parameter ResNet50 + LSTM ViT + BERT ViT + GPT2
Epochs 10 10 10
Batch Size 128 32 32
Learning Rate 0.0001 0.00001 0.00001
Optimizer Adam Adam Adam
Scheduler N/A OneCycleLR OneCycleLR

Evaluation Results

The models were evaluated using popular metrics for image captioning: BLEU (1-4), METEOR, and ROUGE-L. The table below provides the performance scores for each model:

Model BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L
ResNet50 + LSTM 0.648 0.451 0.300 0.202 0.421 0.506
ViT + BERT 0.725 0.551 0.395 0.278 0.501 0.546
ViT + GPT2 0.728 0.545 0.385 0.265 0.502 0.532

Inference Example

Below is an example of how the models perform on a given image. The table shows the reference caption and the predicted captions generated by each model.

Image Reference Caption Predicted Caption
Traffic light
  1. Traffic is stopped at a red stop light.
  2. Cars are stopped at a traffic light on a highway.
  3. A number of red and green traffic lights on a wide highway.
  4. A large and wide street covered in lots of traffic lights.
  5. A traffic light and intersection with cars traveling in both directions on the street.
ResNet50 + LSTM: a traffic light with a street sign on it.
ViT + BERT: a bunch of traffic lights hanging from a wire.
ViT + GPT2: A green traffic light hanging over a street.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support