arxiv:2511.02778

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

Published on Nov 4

· Submitted by

Qinghong (Kevin) Lin on Nov 5

#1 Paper of the day

Jinpeng Group

Upvote

100

Authors:

Kevin Qinghong Lin ,

Hangyu Ran ,

Dantong Zhu ,

Dongxing Mao ,

Abstract

VCode introduces a benchmark for generating SVG code from images to preserve symbolic meaning, highlighting gaps in visual-centric coding and proposing VCoder to improve performance.

AI-generated summary

Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model's intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at https://github.com/CSU-JPG/VCode.

View arXiv page View PDF Project page GitHub 107 Add to collection

Community

KevinQHLin

Paper author Paper submitter 16 days ago

TL;DR: SVG code as Symbolic Visual Representation
Project Page: https://csu-jpg.github.io/VCode/
Github: https://github.com/CSU-JPG/VCode

MRiabov

7 days ago

It's pretty crazy that we can generate extremely high-quality and high-resolution 3d meshes using e.g. Hunyan 3d, but struggle to generate simple representative SVGs. It shouldn't be this way, something is wrong.
Of course, Hunyan is a diffusion model, and was trained on a corpus of millions of 3d meshes, and it's evident that LLMs were not trained on SVG generation at all. But it's too big of a gap in performance.