arxiv:2507.07990

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Published on Jul 10

· Submitted by

js-hyun on Jul 11

Upvote

Authors:

Jeongseok Hyun ,

Abstract

A spatio-temporal token merging method improves video LLM efficiency by exploiting redundancy, achieving significant speed-ups with minimal accuracy loss.

AI-generated summary

Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to exploit local spatial and temporal redundancy in video data which has been overlooked in prior work. STTM first transforms each frame into multi-granular spatial tokens using a coarse-to-fine search over a quadtree structure, then performs directed pairwise merging across the temporal dimension. This decomposed merging approach outperforms existing token reduction methods across six video QA benchmarks. Notably, STTM achieves a 2times speed-up with only a 0.5% accuracy drop under a 50% token budget, and a 3times speed-up with just a 2% drop under a 30% budget. Moreover, STTM is query-agnostic, allowing KV cache reuse across different questions for the same video. The project page is available at https://www.jshyun.me/projects/sttm.

View arXiv page View PDF Project page GitHub 9 Add to collection

Community

js-hyun

Paper author Paper submitter 1 day ago

This comment has been hidden (marked as Resolved)

js-hyun

Paper author Paper submitter 1 day ago

Double the Speed, Zero Training: The Free Lunch for Video LLMs!
Long videos slow things down — LLMs must prefill massive context before responding. We introduce STTM, the first training-free spatio-temporal token merging for Video LLMs. Even better, it’s query-agnostic — so reduced KV caches can be reused across multiple questions for the same video.

🔍 TL;DR
🧩 Merging mechanism. (1) Coarse-to-fine spatial token merging per frame. (2) Directed temporal merging of different granular spatial tokens across nearby frames .
🌐 Model generalization. Validated with LLaVA-Video-7B/72B, LLaVA-OneVision-7B, and Qwen2VL-7B
📊 Dataset coverage. Evaluated on 6 video QA datasets covering 3 categories:
    🔸 NIAH: VNBench
    🔸 Long: VideoMME, LongVideoBench, MLVU
    🔸 Short: EgoSchema, NExT-QA

⚡ Results
🚀 LLaVA-Video-7B. (1) Under 50% tokens, 2.1× speed-up with 99.5% accuracy. (2) Under 30% tokens, 3.0× speed-up with 97.8% accuracy.
🚀 LLaVA-OneVision-7B. (1) Under 50% tokens, 2.2× speed-up with 102.1% accuracy. (2) Under 30% tokens, 3.1× speed-up with 101.1% accuracy.
🚀 Qwen2VL-7B. (1) Under 50% tokens, 2.6× speed-up with 102.7% accuracy. (2) Under 30% tokens, 4.5× speed-up with 100.5% accuracy.
🚀 LLaVA-Video-72B. (1) Under 50% tokens, 2.3× speed-up with 101.3% accuracy. (2) Under 30% tokens, 3.3× speed-up with 99.1% accuracy.