sebastianhariman
/

image_captioning_model

Model card Files Files and versions

image_captioning_model / README.md

sebastianhariman's picture

sebastianhariman

Update README.md

13a34d9 verified 8 months ago

|

3.41 kB

	# Image Captioning Models with ResNet50+LSTM, ViT+BERT, and ViT+GPT2

	This repository contains the implementation of three advanced image captioning models:
	1. ResNet50 + LSTM: A classic approach using Convolutional Neural Networks (CNNs) for image encoding and LSTMs for sequential caption generation.
	2. Vision Transformer (ViT) + BERT: A transformer-based approach leveraging Vision Transformers (ViT) for image encoding and BERT for text generation.
	3. Vision Transformer (ViT) + GPT2: A generative model combining ViT for image encoding with GPT2’s autoregressive capabilities for text generation.

	Each model integrates a robust visual encoder and a natural language processing decoder to generate descriptive captions for input images.

	---

	## Data

	This project uses the [COCO dataset](https://cocodataset.org/) for training and evaluation, which consists of images with multiple human-annotated captions.

	---

	## Hyperparameters

	The following table summarizes the key training configurations used for each model:

	\| Parameter \| ResNet50 + LSTM \| ViT + BERT \| ViT + GPT2 \|
	\|-------------------\|---------------------\|-----------------\|-----------------\|
	\| Epochs \| 10 \| 10 \| 10 \|
	\| Batch Size \| 128 \| 32 \| 32 \|
	\| Learning Rate \| 0.0001 \| 0.00001 \| 0.00001 \|
	\| Optimizer \| Adam \| Adam \| Adam \|
	\| Scheduler \| N/A \| OneCycleLR \| OneCycleLR \|

	---

	## Evaluation Results

	The models were evaluated using popular metrics for image captioning: BLEU (1-4), METEOR, and ROUGE-L. The table below provides the performance scores for each model:

	\| Model \| BLEU-1 \| BLEU-2 \| BLEU-3 \| BLEU-4 \| METEOR \| ROUGE-L \|
	\|--------------------\|------------\|------------\|------------\|------------\|------------\|-------------\|
	\| ResNet50 + LSTM\| 0.648 \| 0.451 \| 0.300 \| 0.202 \| 0.421 \| 0.506 \|
	\| ViT + BERT \| 0.725 \| 0.551 \| 0.395 \| 0.278 \| 0.501 \| 0.546 \|
	\| ViT + GPT2 \| 0.728 \| 0.545 \| 0.385 \| 0.265 \| 0.502 \| 0.532 \|

	---

	## Inference Example

	Below is an example of how the models perform on a given image. The table shows the reference caption and the predicted captions generated by each model.

	<table>
	<tr>
	<th>Image</th>
	<th>Reference Caption</th>
	<th>Predicted Caption</th>
	</tr>
	<tr>
	<td>
	<img src="examples/000000166391.jpg" alt="Traffic light" width="300">
	</td>
	<td>
	<ol>
	<li>Traffic is stopped at a red stop light.</li>
	<li>Cars are stopped at a traffic light on a highway.</li>
	<li>A number of red and green traffic lights on a wide highway.</li>
	<li>A large and wide street covered in lots of traffic lights.</li>
	<li>A traffic light and intersection with cars traveling in both directions on the street.</li>
	</ol>
	</td>
	<td>
	<b>ResNet50 + LSTM:</b> a traffic light with a street sign on it.<br>
	<b>ViT + BERT:</b> a bunch of traffic lights hanging from a wire.<br>
	<b>ViT + GPT2:</b> A green traffic light hanging over a street.
	</td>
	</tr>
	</table>