Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning
Abstract
Contrastive Attention Refinement for Visual Enhancement (CARVE) improves VLM performance by extracting task-relevant visual signals through attention contrasting, addressing issues with visual complexity and attention mechanisms.
Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments. While existing enhancement approaches require additional training, rely on external segmentation tools, or operate at coarse-grained levels, they overlook the innate ability within VLMs. To bridge this gap, we investigate VLMs' attention patterns and discover that: (1) visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance; (2) attention progressively refines from global scanning in shallow layers to focused convergence in deeper layers, with convergence degree determined by visual complexity. (3) Theoretically, we prove that the contrast of attention maps between general queries and task-specific queries enables the decomposition of visual signal into semantic signals and visual noise components. Building on these insights, we propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level. Extensive experiments demonstrate that CARVE consistently enhances performance, achieving up to 75% improvement on open-source models. Our work provides critical insights into the interplay between visual complexity and attention mechanisms, offering an efficient pathway for improving visual reasoning with contrasting attention.
Community
The innate abilities of VLMs are powerful and have been overlooked. Our work provides critical insights into the interplay between visual complexity and attention mechanisms, offering an efficient pathway for improving visual reasoning with contrasting attention.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Variation-aware Vision Token Dropping for Faster Large Vision-Language Models (2025)
- $\Delta$-AttnMask: Attention-Guided Masked Hidden States for Efficient Data Selection and Augmentation (2025)
- A2R2: Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement (2025)
- HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models (2025)
- Optimizing Vision-Language Consistency via Cross-Layer Regional Attention Alignment (2025)
- Simple o3: Towards Interleaved Vision-Language Reasoning (2025)
- MoCHA: Advanced Vision-Language Reasoning with MoE Connector and Hierarchical Group Attention (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper