Papers
arxiv:2507.07990

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Published on Jul 10
Β· Submitted by js-hyun on Jul 11
Authors:
,
,
,
,
,
,
,

Abstract

A spatio-temporal token merging method improves video LLM efficiency by exploiting redundancy, achieving significant speed-ups with minimal accuracy loss.

AI-generated summary

Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to exploit local spatial and temporal redundancy in video data which has been overlooked in prior work. STTM first transforms each frame into multi-granular spatial tokens using a coarse-to-fine search over a quadtree structure, then performs directed pairwise merging across the temporal dimension. This decomposed merging approach outperforms existing token reduction methods across six video QA benchmarks. Notably, STTM achieves a 2times speed-up with only a 0.5% accuracy drop under a 50% token budget, and a 3times speed-up with just a 2% drop under a 30% budget. Moreover, STTM is query-agnostic, allowing KV cache reuse across different questions for the same video. The project page is available at https://www.jshyun.me/projects/sttm.

Community

Paper author Paper submitter
This comment has been hidden (marked as Resolved)
Paper author Paper submitter

Double the Speed, Zero Training: The Free Lunch for Video LLMs!
Long videos slow things down β€” LLMs must prefill massive context before responding. We introduce STTM, the first training-free spatio-temporal token merging for Video LLMs. Even better, it’s query-agnostic β€” so reduced KV caches can be reused across multiple questions for the same video.

πŸ” TL;DR
🧩 Merging mechanism. (1) Coarse-to-fine spatial token merging per frame. (2) Directed temporal merging of different granular spatial tokens across nearby frames .
🌐 Model generalization. Validated with LLaVA-Video-7B/72B, LLaVA-OneVision-7B, and Qwen2VL-7B
πŸ“Š Dataset coverage. Evaluated on 6 video QA datasets covering 3 categories:
    πŸ”Έ NIAH: VNBench
    πŸ”Έ Long: VideoMME, LongVideoBench, MLVU
    πŸ”Έ Short: EgoSchema, NExT-QA

⚑ Results
πŸš€ LLaVA-Video-7B. (1) Under 50% tokens, 2.1Γ— speed-up with 99.5% accuracy. (2) Under 30% tokens, 3.0Γ— speed-up with 97.8% accuracy.
πŸš€ LLaVA-OneVision-7B. (1) Under 50% tokens, 2.2Γ— speed-up with 102.1% accuracy. (2) Under 30% tokens, 3.1Γ— speed-up with 101.1% accuracy.
πŸš€ Qwen2VL-7B. (1) Under 50% tokens, 2.6Γ— speed-up with 102.7% accuracy. (2) Under 30% tokens, 4.5Γ— speed-up with 100.5% accuracy.
πŸš€ LLaVA-Video-72B. (1) Under 50% tokens, 2.3Γ— speed-up with 101.3% accuracy. (2) Under 30% tokens, 3.3Γ— speed-up with 99.1% accuracy.

Paper author Paper submitter

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.07990 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.07990 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.07990 in a Space README.md to link it from this page.

Collections including this paper 1