arxiv:2512.03041

MultiShotMaster: A Controllable Multi-Shot Video Generation Framework

Published on Dec 2

· Submitted by

QINGHE WANG on Dec 3

#3 Paper of the day

Kling Team

Upvote

Authors:

Qinghe Wang ,

Baolu Li ,

Abstract

MultiShotMaster extends a single-shot model with novel RoPE variants for flexible and controllable multi-shot video generation, addressing data scarcity with an automated annotation pipeline.

AI-generated summary

Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.

View arXiv page View PDF Project page Add to collection

Community

DecoderWQH666

Paper author Paper submitter about 22 hours ago

The first controllable multi-shot video generation framework that supports text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot counts and shot durations are variable.

EladofWar

about 6 hours ago

Seems like it's only one step out from being SORA 2. Honestly if I had the compute, I'd love to work on this. Add just a bit more multimodal conditioning like by using VACE and maybe multitalk components to serve as your base model and you'd have a model that can generate from scratch with audio capability, and the ability to restyle/edit videos. There are also plenty of optimizations on auto-regressive generation that could be used for better speedups. So much good work that could be done.

DecoderWQH666

Paper author 44 minutes ago

Thank you for your interest and insightful comments! We have the same vision as well. There are some new research directions in the multi-shot setting:

Extending controllable functionalities (such as multimodal conditioning, style transfer) from single-shot to multi-shot settings.
Integrating audio capabilities for multi-shot conversation like Mocha (https://congwei1230.github.io/MoCha/).
Auto-regressive for next-shot generation like Cut2Next (https://vchitect.github.io/Cut2Next-project/), LCT(https://guoyww.github.io/projects/long-context-video/), Mask2Dit (https://tianhao-qi.github.io/Mask2DiTProject/).

The fundamental challenges include efficient implementations, data curation, and computing power.