YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Image Captioning Models with ResNet50+LSTM, ViT+BERT, and ViT+GPT2

This repository contains the implementation of three advanced image captioning models:

ResNet50 + LSTM: A classic approach using Convolutional Neural Networks (CNNs) for image encoding and LSTMs for sequential caption generation.
Vision Transformer (ViT) + BERT: A transformer-based approach leveraging Vision Transformers (ViT) for image encoding and BERT for text generation.
Vision Transformer (ViT) + GPT2: A generative model combining ViT for image encoding with GPT2’s autoregressive capabilities for text generation.

Each model integrates a robust visual encoder and a natural language processing decoder to generate descriptive captions for input images.

Data

This project uses the COCO dataset for training and evaluation, which consists of images with multiple human-annotated captions.

Hyperparameters

The following table summarizes the key training configurations used for each model:

Parameter	ResNet50 + LSTM	ViT + BERT	ViT + GPT2
Epochs	10	10	10
Batch Size	128	32	32
Learning Rate	0.0001	0.00001	0.00001
Optimizer	Adam	Adam	Adam
Scheduler	N/A	OneCycleLR	OneCycleLR

Evaluation Results

The models were evaluated using popular metrics for image captioning: BLEU (1-4), METEOR, and ROUGE-L. The table below provides the performance scores for each model:

Model	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L
ResNet50 + LSTM	0.648	0.451	0.300	0.202	0.421	0.506
ViT + BERT	0.725	0.551	0.395	0.278	0.501	0.546
ViT + GPT2	0.728	0.545	0.385	0.265	0.502	0.532

Inference Example

Below is an example of how the models perform on a given image. The table shows the reference caption and the predicted captions generated by each model.

Image	Reference Caption	Predicted Caption
	Traffic is stopped at a red stop light. Cars are stopped at a traffic light on a highway. A number of red and green traffic lights on a wide highway. A large and wide street covered in lots of traffic lights. A traffic light and intersection with cars traveling in both directions on the street.	ResNet50 + LSTM: a traffic light with a street sign on it. ViT + BERT: a bunch of traffic lights hanging from a wire. ViT + GPT2: A green traffic light hanging over a street.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support