MultiShotMaster: A Controllable Multi-Shot Video Generation Framework
Abstract
MultiShotMaster extends a single-shot model with novel RoPE variants for flexible and controllable multi-shot video generation, addressing data scarcity with an automated annotation pipeline.
Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.
Community
The first controllable multi-shot video generation framework that supports text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot counts and shot durations are variable.
Seems like it's only one step out from being SORA 2. Honestly if I had the compute, I'd love to work on this. Add just a bit more multimodal conditioning like by using VACE and maybe multitalk components to serve as your base model and you'd have a model that can generate from scratch with audio capability, and the ability to restyle/edit videos. There are also plenty of optimizations on auto-regressive generation that could be used for better speedups. So much good work that could be done.
Thank you for your interest and insightful comments! We have the same vision as well. There are some new research directions in the multi-shot setting:
- Extending controllable functionalities (such as multimodal conditioning, style transfer) from single-shot to multi-shot settings.
- Integrating audio capabilities for multi-shot conversation like Mocha (https://congwei1230.github.io/MoCha/).
- Auto-regressive for next-shot generation like Cut2Next (https://vchitect.github.io/Cut2Next-project/), LCT(https://guoyww.github.io/projects/long-context-video/), Mask2Dit (https://tianhao-qi.github.io/Mask2DiTProject/).
The fundamental challenges include efficient implementations, data curation, and computing power.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives (2025)
- MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation (2025)
- VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning (2025)
- InstanceV: Instance-Level Video Generation (2025)
- TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction (2025)
- TGT: Text-Grounded Trajectories for Locally Controlled Video Generation (2025)
- Video-As-Prompt: Unified Semantic Control for Video Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper