Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers
Abstract
Sparse-vDiT accelerates Video Diffusion Transformer (vDiT) by leveraging sparsity patterns in attention maps, reducing theoretical FLOPs and improving inference speed without significant loss in visual quality.
While Diffusion Transformers (DiTs) have achieved breakthroughs in video generation, this long sequence generation task remains constrained by the quadratic complexity of attention mechanisms, resulting in significant inference latency. Through detailed analysis of attention maps in Video Diffusion Transformer (vDiT), we identify three recurring sparsity patterns: diagonal, multi-diagonal, and vertical-stripe structures. And even 3-6\% attention heads can be skipped. Crucially, these patterns exhibit strong layer-depth and head-position correlations but show limited dependence on the input content. Leveraging these findings, we propose Sparse-vDiT, a sparsity acceleration framework for vDiT comprising: 1) Pattern-optimized sparse kernels that replace dense attention with computationally efficient implementations for each identified sparsity pattern. 2) An offline sparse diffusion search algorithm that selects the optimal sparse computation strategy per layer and head via hardware-aware cost modeling. After determining the optimal configuration, we fuse heads within the same layer that share the same attention strategy, enhancing inference efficiency. Integrated into state-of-the-art vDiT models (CogVideoX1.5, HunyuanVideo, and Wan2.1), Sparse-vDiT achieves 2.09times, 2.38times, and 1.67times theoretical FLOP reduction, and actual inference speedups of 1.76times, 1.85times, and 1.58times, respectively, while maintaining high visual fidelity, with PSNR values reaching 24.13, 27.09, and 22.59. Our work demonstrates that latent structural sparsity in vDiTs can be systematically exploited for long video synthesis.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VORTA: Efficient Video Diffusion via Routing Sparse Attention (2025)
- DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance (2025)
- Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation (2025)
- RainFusion: Adaptive Video Generation Acceleration via Multi-Dimensional Visual Redundancy (2025)
- VSA: Faster Video Diffusion with Trainable Sparse Attention (2025)
- Analysis of Attention in Video Diffusion Transformers (2025)
- Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper