--- license: cc-by-nc-4.0 tags: - visual-question-answering - multimodal - pytorch - cross-attention - vision-transformer pipeline_tag: visual-question-answering --- # Visual Question Answering (VQA) Model This is a multimodal Visual Question Answering system built for my Bachelor's final project. It combines a Vision Transformer (ViT) image encoder and a SmolLM2 language model using a cross-attention mechanism. ## Model Architecture - **Vision Encoder:** Pretrained ViT - **Language Model:** SmolLM2-135M - **Fusion:** Cross-attention layer aligning vision and language - **Dataset:** VQA v2 and LLaVa datasets for training ## How to Use ```python from transformers import AutoProcessor, AutoModelForVisualQuestionAnswering from PIL import Image processor = AutoProcessor.from_pretrained("mehmetkuzucu/Waffle-v1.0") model = AutoModelForVisualQuestionAnswering.from_pretrained("mehmetkuzucu/Waffle-v1.0") image = Image.open("example.jpg") question = "What is the person doing?" inputs = processor(images=image, text=question, return_tensors="pt") outputs = model(**inputs) answer = processor.tokenizer.decode(outputs.logits.argmax(-1))