Multimodal Long Video Modeling Based on Temporal Dynamic Context
Abstract
Recent advances in Large Language Models (LLMs) have led to significant breakthroughs in video understanding. However, existing models still struggle with long video processing due to the context length constraint of LLMs and the vast amount of information within the video. Although some recent methods are designed for long video understanding, they often lose crucial information during token compression and struggle with additional modality like audio. In this work, we propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC). Firstly, we segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders. Secondly, we propose a novel temporal context compressor to reduce the number of tokens within each segment. Specifically, we employ a query-based Transformer to aggregate video, audio, and instruction text tokens into a limited set of temporal context tokens. Finally, we feed the static frame tokens and the temporal context tokens into the LLM for video understanding. Furthermore, to handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments. These intermediate answers serve as part of the reasoning process and contribute to the final answer. We conduct extensive experiments on general video understanding and audio-video understanding benchmarks, where our method demonstrates strong performance. The code and models are available at https://github.com/Hoar012/TDC-Video.
Community
- We propose a framework for multimodal video modeling, which represents videos using both static visual features and dynamic multimodal context, effectively integrating visual and audio information within a unified video context.
- We introduce the Long Video Chain-of-Thought (LVCoT), a training-free strategy that enables MLLMs to process and reason over long videos step by step, enhancing the performance of existing models.
- We conduct extensive experiments with MLLMs of various sizes and evaluate them on multiple benchmarks, including general video question answering, long video understanding, and audio-visual video comprehension. Our models achieve strong performance, advancing the field of multimodal long video understanding.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Token-Efficient Long Video Understanding for Multimodal LLMs (2025)
- Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding (2025)
- LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding (2025)
- Improving LLM Video Understanding with 16 Frames Per Second (2025)
- Breaking the Encoder Barrier for Seamless Video-Language Understanding (2025)
- VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding (2025)
- M-LLM Based Video Frame Selection for Efficient Video Understanding (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper