Text-Based Reasoning About Vector Graphics

🌐 Homepage β€’ πŸ“ƒ Paper β€’ πŸ€— Data (PVD-160k) β€’ πŸ€— Model (PVD-160k-Mistral-7b) β€’ πŸ’» Code

We observe that current large multimodal models (LMMs) still struggle with seemingly straightforward reasoning tasks that require precise perception of low-level visual details, such as identifying spatial relations or solving simple mazes. In particular, this failure mode persists in question-answering tasks about vector graphicsβ€”images composed purely of 2D objects and shapes.

Teaser

To solve this challenge, we propose Visually Descriptive Language Model (VDLM), a visual reasoning framework that operates with intermediate text-based visual descriptionsβ€”SVG representations and learned Primal Visual Description, which can be directly integrated into existing LLMs and LMMs. We demonstrate that VDLM outperforms state-of-the-art large multimodal models, such as GPT-4V, across various multimodal reasoning tasks involving vector graphics. See our paper for more details. Overview

Downloads last month
20
Safetensors
Model size
7.24B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mikewang/PVD-160k-Mistral-7b

Quantizations
2 models

Dataset used to train mikewang/PVD-160k-Mistral-7b

Spaces using mikewang/PVD-160k-Mistral-7b 9