Learning to Inference Adaptively for Multimodal Large Language Models
Abstract
Multimodal Large Language Models (MLLMs) have shown impressive capabilities in reasoning, yet come with substantial computational cost, limiting their deployment in resource-constrained settings. Despite recent efforts on improving the efficiency of MLLMs, prior solutions fall short in responding to varying runtime conditions, in particular changing resource availability (e.g., contention due to the execution of other programs on the device). To bridge this gap, we introduce AdaLLaVA, an adaptive inference framework that learns to dynamically reconfigure operations in an MLLM during inference, accounting for the input data and a latency budget. We conduct extensive experiments across benchmarks involving question-answering, reasoning, and hallucination. Our results show that AdaLLaVA effectively adheres to input latency budget, achieving varying accuracy and latency tradeoffs at runtime. Further, we demonstrate that AdaLLaVA adapts to both input latency and content, can be integrated with token selection for enhanced efficiency, and generalizes across MLLMs. Our project webpage with code release is at https://zhuoyan-xu.github.io/ada-llava/.
Community
Project: https://zhuoyan-xu.github.io/ada-llava/
Paper: https://arxiv.org/abs/2503.10905
Code: https://github.com/zhuoyan-xu/AdaLLaVA
We present AdaLLaVA, a learning-based framework for adaptive inference in MLLMs. As shown in Figure, given an input image, a text query, and a latency budget, AdaLLaVA empowers an MLLM to answer the query about the image while adhering to the specified budget — a capability unattainable with the base MLLM.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Balcony: A Lightweight Approach to Dynamic Inference of Generative Language Models (2025)
- LVPruning: An Effective yet Simple Language-Guided Vision Token Pruning Approach for Multi-modal Large Language Models (2025)
- OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models (2025)
- TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models (2025)
- Towards Fast, Memory-based and Data-Efficient Vision-Language Policy (2025)
- Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success (2025)
- Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper