vilt_finetuned_200

Model description

This is a fine-tuned version of the Vilt (Vision-and-Language Transformer) model, specifically the dandelin/vilt-b32-mlm checkpoint, for Visual Question Answering (VQA). It has been trained on a subset of the Graphcore/vqa dataset, using the validation split with the first 200 samples.

The model takes an image and a question as input and predicts the answer to the question based on the image content. It leverages the power of transformers to combine visual and textual information for effective understanding and reasoning

Intended uses & limitations

Intended uses:

Visual Question Answering: The primary use case is to answer questions about images. It can be used in applications like image captioning, visual dialogue systems, and knowledge extraction from images.
Research and development: The model can serve as a baseline for further research and development in VQA and related tasks.

Limitations:

Dataset size: The model was fine-tuned on a relatively small subset (200 samples) of the VQA dataset, which may limit its generalization ability to unseen images and questions.
Bias: The model may exhibit biases present in the training data, potentially leading to inaccurate or unfair predictions in certain scenarios.
Complex reasoning: While Vilt is designed for visual reasoning, it may struggle with questions requiring complex or abstract reasoning abilities.

Training and evaluation data

Training data:

Dataset: Graphcore/vqa
Split: Validation
Subset: First 200 samples The training data consists of image-question-answer triplets, where the model learns to predict the answer based on the image and question.

Evaluation data:

Currently, there is no specific evaluation dataset or metrics reported for this fine-tuned model. It is recommended to evaluate the model's performance on a separate VQA benchmark dataset to assess its generalization ability and compare it to other models.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 4
eval_batch_size: 8
seed: 42
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
num_epochs: 20

Framework versions

Transformers 4.50.0
Pytorch 2.6.0+cu124
Datasets 3.5.0
Tokenizers 0.21.1

KFrimps
/

vilt_finetuned_200