vilt_finetuned_200
Model description
This is a fine-tuned version of the Vilt (Vision-and-Language Transformer) model, specifically the dandelin/vilt-b32-mlm checkpoint, for Visual Question Answering (VQA). It has been trained on a subset of the Graphcore/vqa dataset, using the validation split with the first 200 samples.
The model takes an image and a question as input and predicts the answer to the question based on the image content. It leverages the power of transformers to combine visual and textual information for effective understanding and reasoning
Intended uses & limitations
Intended uses:
- Visual Question Answering: The primary use case is to answer questions about images. It can be used in applications like image captioning, visual dialogue systems, and knowledge extraction from images.
- Research and development: The model can serve as a baseline for further research and development in VQA and related tasks.
Limitations:
- Dataset size: The model was fine-tuned on a relatively small subset (200 samples) of the VQA dataset, which may limit its generalization ability to unseen images and questions.
- Bias: The model may exhibit biases present in the training data, potentially leading to inaccurate or unfair predictions in certain scenarios.
- Complex reasoning: While Vilt is designed for visual reasoning, it may struggle with questions requiring complex or abstract reasoning abilities.
Training and evaluation data
Training data:
- Dataset: Graphcore/vqa
- Split: Validation
- Subset: First 200 samples The training data consists of image-question-answer triplets, where the model learns to predict the answer based on the image and question.
Evaluation data:
Currently, there is no specific evaluation dataset or metrics reported for this fine-tuned model. It is recommended to evaluate the model's performance on a separate VQA benchmark dataset to assess its generalization ability and compare it to other models.
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 4
- eval_batch_size: 8
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- num_epochs: 20
Framework versions
- Transformers 4.50.0
- Pytorch 2.6.0+cu124
- Datasets 3.5.0
- Tokenizers 0.21.1
- Downloads last month
- 7
Model tree for KFrimps/vilt_finetuned_200
Base model
dandelin/vilt-b32-mlm