arxiv:2403.16998

Understanding Long Videos with Multimodal Language Models

Published on Mar 25, 2024

Authors:

Abstract

Large Language Models (LLMs) have allowed recent LLM-based approaches to achieve excellent performance on long-video understanding benchmarks. We investigate how extensive world knowledge and strong reasoning skills of underlying LLMs influence this strong performance. Surprisingly, we discover that LLM-based approaches can yield surprisingly good accuracy on long-video tasks with limited video information, sometimes even with no video specific information. Building on this, we explore injecting video-specific information into an LLM-based framework. We utilize off-the-shelf vision tools to extract three object-centric information modalities from videos, and then leverage natural language as a medium for fusing this information. Our resulting Multimodal Video Understanding (MVU) framework demonstrates state-of-the-art performance across multiple video understanding benchmarks. Strong performance also on robotics domain tasks establish its strong generality. Code: https://github.com/kahnchana/mvu

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2403.16998 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2403.16998 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2403.16998 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.