arxiv:2504.10443

Multimodal Long Video Modeling Based on Temporal Dynamic Context

Published on Apr 14

· Submitted by

Hoar012 on Apr 16

Upvote

Authors:

Haoran Hao ,

Jiaming Han ,

Abstract

Recent advances in Large Language Models (LLMs) have led to significant breakthroughs in video understanding. However, existing models still struggle with long video processing due to the context length constraint of LLMs and the vast amount of information within the video. Although some recent methods are designed for long video understanding, they often lose crucial information during token compression and struggle with additional modality like audio. In this work, we propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC). Firstly, we segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders. Secondly, we propose a novel temporal context compressor to reduce the number of tokens within each segment. Specifically, we employ a query-based Transformer to aggregate video, audio, and instruction text tokens into a limited set of temporal context tokens. Finally, we feed the static frame tokens and the temporal context tokens into the LLM for video understanding. Furthermore, to handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments. These intermediate answers serve as part of the reasoning process and contribute to the final answer. We conduct extensive experiments on general video understanding and audio-video understanding benchmarks, where our method demonstrates strong performance. The code and models are available at https://github.com/Hoar012/TDC-Video.

View arXiv page View PDF Project page GitHub repository Add to collection

Community

Hoar012

Paper author Paper submitter 5 days ago

We propose a framework for multimodal video modeling, which represents videos using both static visual features and dynamic multimodal context, effectively integrating visual and audio information within a unified video context.
We introduce the Long Video Chain-of-Thought (LVCoT), a training-free strategy that enables MLLMs to process and reason over long videos step by step, enhancing the performance of existing models.
We conduct extensive experiments with MLLMs of various sizes and evaluate them on multiple benchmarks, including general video question answering, long video understanding, and audio-visual video comprehension. Our models achieve strong performance, advancing the field of multimodal long video understanding.

librarian-bot

4 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.10443 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.10443 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.10443 in a Space README.md to link it from this page.