arxiv:2606.12412

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

Published on Jun 10

· Submitted by

Authors:

Abstract

Vision-language models can improve grounding performance under aggressive token reduction by replacing irreversible visual-token pruning with recoverable routing that allows tokens to re-enter the processing pipeline at later stages.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Vision-language models (VLMs) project images into hundreds to thousands of visual tokens, making decoder inference expensive in both attention computation and KV-cache memory. Existing visual-token reduction methods largely follow a rank-and-remove paradigm: they score visual tokens, keep a compact subset, and permanently discard the rest. We show that this irreversible action is fragile because visual-token importance changes across decoder depth; tokens ranked low at one stage may become relevant in later layers, especially for grounding-sensitive queries. We propose Reroute, a training-free plug-in that replaces removal with recoverable routing. At each routing stage, selected vision tokens pass through decoder blocks, while deferred tokens bypass the stage and re-enter the candidate pool at the next routing decision. Reroute reuses existing attention-score ranking rules and stage-wise schedules, preserving the theoretical TFLOPs and KV-cache budget class of the pruning method it augments. Across FastV, PDrop, and Nüwa variants on LLaVA-1.5 and Qwen backbones, reroute improves grounding under aggressive token reduction while maintaining general VQA performance. These results suggest that VLM token reduction should not be viewed only as irreversible pruning, but also as recoverable routing. The code can be found here: https://github.com/elmma/mllm-reroute/

View arXiv page View PDF GitHub 15 Add to collection

Community

yulunliu

Paper submitter about 23 hours ago

noahml

about 2 hours ago

Neat paper. The idea that current token reduction methods are too aggressive by permanently discarding visual information makes a lot of sense, especially since token importance seems to shift as you go deeper into the decoder layers.

I'm curious how this impacts the overall inference latency in practice. Does the overhead of managing the routing and re-entry process add much to the compute time compared to just dropping the tokens entirely?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/cdbfb49f-8f42-44f3-b90b-1c9e2b501462