Waffle-v1.0 / README.md
mehmetkuzucu's picture
Update README.md
6951599 verified
metadata
license: cc-by-nc-4.0
tags:
  - visual-question-answering
  - multimodal
  - pytorch
  - cross-attention
  - vision-transformer
pipeline_tag: visual-question-answering

Visual Question Answering (VQA) Model

This is a multimodal Visual Question Answering system built for my Bachelor's final project. It combines a Vision Transformer (ViT) image encoder and a SmolLM2 language model using a cross-attention mechanism.

Model Architecture

  • Vision Encoder: Pretrained ViT
  • Language Model: SmolLM2-135M
  • Fusion: Cross-attention layer aligning vision and language
  • Dataset: VQA v2 and LLaVa datasets for training

How to Use

from transformers import AutoProcessor, AutoModelForVisualQuestionAnswering
from PIL import Image

processor = AutoProcessor.from_pretrained("mehmetkuzucu/Waffle-v1.0")
model = AutoModelForVisualQuestionAnswering.from_pretrained("mehmetkuzucu/Waffle-v1.0")

image = Image.open("example.jpg")
question = "What is the person doing?"

inputs = processor(images=image, text=question, return_tensors="pt")
outputs = model(**inputs)
answer = processor.tokenizer.decode(outputs.logits.argmax(-1))