VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models
Abstract
VP-VLA is a dual-system framework that decouples high-level reasoning from low-level robotic control through structured visual prompting, improving spatial precision and robustness in vision-language-action tasks.
Vision-Language-Action (VLA) models typically map visual observations and linguistic instructions directly to robotic control signals. This "black-box" mapping forces a single forward pass to simultaneously handle instruction interpretation, spatial grounding, and low-level control, often leading to poor spatial precision and limited robustness in out-of-distribution scenarios. To address these limitations, we propose VP-VLA, a dual-system framework that decouples high-level reasoning and low-level execution via a structured visual prompting interface. Specifically, a "System 2 Planner" decomposes complex instructions into sub-tasks and identifies relevant target objects and goal locations. These spatial anchors are then overlaid directly onto visual observations as structured visual prompts, such as crosshairs and bounding boxes. Guided by these prompts and enhanced by a novel auxiliary visual grounding objective during training, a "System 1 Controller" reliably generates precise low-level execution motions. Experiments on the Robocasa-GR1-Tabletop benchmark and SimplerEnv simulation demonstrate that VP-VLA improves success rates by 5% and 8.3%, surpassing competitive baselines including QwenOFT and GR00T-N1.6.
Community
We propose VP-VLA, a dual-system framework that decouples high-level reasoning and low-level execution via a structured visual prompting interface.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ST4VLA: Spatially Guided Training for Vision-Language-Action Models (2026)
- Scaling World Model for Hierarchical Manipulation Policies (2026)
- ForeAct: Steering Your VLA with Efficient Visual Foresight Planning (2026)
- ST-VLA: Enabling 4D-Aware Spatiotemporal Understanding for General Robot Manipulation (2026)
- Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models (2026)
- Language-Grounded Decoupled Action Representation for Robotic Manipulation (2026)
- HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2603.22003 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper