Streamline Without Sacrifice - Squeeze out Computation Redundancy in LMM
Abstract
ProxyV alleviates computational burdens in large multimodal models by using proxy vision tokens, enhancing efficiency without sacrificing performance.
Large multimodal models excel in multimodal tasks but face significant computational challenges due to excessive computation on visual tokens. Unlike token reduction methods that focus on token-level redundancy, we identify and study the computation-level redundancy on vision tokens to ensure no information loss. Our key insight is that vision tokens from the pretrained vision encoder do not necessarily require all the heavy operations (e.g., self-attention, FFNs) in decoder-only LMMs and could be processed more lightly with proper designs. We designed a series of experiments to discover and progressively squeeze out the vision-related computation redundancy. Based on our findings, we propose ProxyV, a novel approach that utilizes proxy vision tokens to alleviate the computational burden on original vision tokens. ProxyV enhances efficiency without compromising performance and can even yield notable performance gains in scenarios with more moderate efficiency improvements. Furthermore, the flexibility of ProxyV is demonstrated through its combination with token reduction methods to boost efficiency further. The code will be made public at this https://github.com/penghao-wu/ProxyV URL.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- QG-VTC: Question-Guided Visual Token Compression in MLLMs for Efficient VQA (2025)
- STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference (2025)
- Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping (2025)
- Window Token Concatenation for Efficient Visual Large Language Models (2025)
- Slow-Fast Architecture for Video Multi-Modal Large Language Models (2025)
- Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models (2025)
- DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper