MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs
Abstract
MotionSight, a zero-shot method using object-centric visual spotlight and motion blur as prompts, enhances fine-grained video motion understanding and achieves state-of-the-art performance on MotionVid-QA, a large-scale dataset with hierarchical annotations.
Despite advancements in Multimodal Large Language Models (MLLMs), their proficiency in fine-grained video motion understanding remains critically limited. They often lack inter-frame differencing and tend to average or ignore subtle visual cues. Furthermore, while visual prompting has shown potential in static images, its application to video's temporal complexities, particularly for fine-grained motion understanding, remains largely unexplored. We investigate whether inherent capability can be unlocked and boost MLLMs' motion perception and enable distinct visual signatures tailored to decouple object and camera motion cues. In this study, we introduce MotionSight, a novel zero-shot method pioneering object-centric visual spotlight and motion blur as visual prompts to effectively improve fine-grained motion understanding without training. To convert this into valuable data assets, we curated MotionVid-QA, the first large-scale dataset for fine-grained video motion understanding, with hierarchical annotations including SFT and preference data, {\Theta}(40K) video clips and {\Theta}(87K) QAs. Experiments show MotionSight achieves state-of-the-art open-source performance and competitiveness with commercial models. In particular, for fine-grained motion understanding we present a novel zero-shot technique and a large-scale, high-quality dataset. All the code and annotations will be publicly available.
Community
MotionSight: A zero-shot method and dataset (MotionVid-QA) for fine-grained video motion understanding with MLLMs.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion (2025)
- Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation (2025)
- SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models (2025)
- ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding (2025)
- LMP: Leveraging Motion Prior in Zero-Shot Video Generation with Diffusion Transformer (2025)
- SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding (2025)
- MOOSE: Pay Attention to Temporal Dynamics for Video Understanding via Optical Flows (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper