---
license: cc-by-nc-4.0
tags:
  - visual-question-answering
  - multimodal
  - pytorch
  - cross-attention
  - vision-transformer
pipeline_tag: visual-question-answering
---

# Visual Question Answering (VQA) Model

This is a multimodal Visual Question Answering system built for my Bachelor's final project. It combines a Vision Transformer (ViT) image encoder and a SmolLM2 language model using a cross-attention mechanism.

## Model Architecture

- **Vision Encoder:** Pretrained ViT
- **Language Model:** SmolLM2-135M
- **Fusion:** Cross-attention layer aligning vision and language
- **Dataset:** VQA v2 and LLaVa datasets for training

## How to Use

```python
from transformers import AutoProcessor, AutoModelForVisualQuestionAnswering
from PIL import Image

processor = AutoProcessor.from_pretrained("mehmetkuzucu/Waffle-v1.0")
model = AutoModelForVisualQuestionAnswering.from_pretrained("mehmetkuzucu/Waffle-v1.0")

image = Image.open("example.jpg")
question = "What is the person doing?"

inputs = processor(images=image, text=question, return_tensors="pt")
outputs = model(**inputs)
answer = processor.tokenizer.decode(outputs.logits.argmax(-1))