HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
Abstract
HoloCine generates coherent multi-shot narratives using a Window Cross-Attention mechanism and Sparse Inter-Shot Self-Attention, enabling end-to-end cinematic creation.
State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this "narrative gap" with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within shots but sparse between them) ensures the efficiency required for minute-scale generation. Beyond setting a new state-of-the-art in narrative coherence, HoloCine develops remarkable emergent abilities: a persistent memory for characters and scenes, and an intuitive grasp of cinematic techniques. Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future. Our code is available at: https://holo-cine.github.io/.
Community
HoloCine is a text-to-video framework that holistically generates coherent, cinematic multi-shot video narratives from a single prompt, combining Window Cross-Attention for per-shot control and Sparse Inter-Shot Self-Attention for efficient, consistent long-scene generation.
Thanks a lot
@taesiri
for helping us submit our paper to the daily papers! ๐
Could we please use the following video as the cover to better showcase our results?
๐ฅ https://holo-cine.github.io/holocine.mp4
Congrats on the amazing work!
Unfortunately, it seems the media tag cannot be updated after an initial submission has been made. I think I messed that up, super sorry about that.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation (2025)
- Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis (2025)
- CharCom: Composable Identity Control for Multi-Character Story Illustration (2025)
- AudioStory: Generating Long-Form Narrative Audio with Large Language Models (2025)
- LongLive: Real-time Interactive Long Video Generation (2025)
- Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset (2025)
- TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper