Papers
arxiv:2512.03041

MultiShotMaster: A Controllable Multi-Shot Video Generation Framework

Published on Dec 2
· Submitted by QINGHE WANG on Dec 3
#3 Paper of the day
Authors:
,
,
,
,
,
,
,

Abstract

MultiShotMaster extends a single-shot model with novel RoPE variants for flexible and controllable multi-shot video generation, addressing data scarcity with an automated annotation pipeline.

AI-generated summary

Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.

Community

Paper author Paper submitter

The first controllable multi-shot video generation framework that supports text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot counts and shot durations are variable.

Seems like it's only one step out from being SORA 2. Honestly if I had the compute, I'd love to work on this. Add just a bit more multimodal conditioning like by using VACE and maybe multitalk components to serve as your base model and you'd have a model that can generate from scratch with audio capability, and the ability to restyle/edit videos. There are also plenty of optimizations on auto-regressive generation that could be used for better speedups. So much good work that could be done.

·

Thank you for your interest and insightful comments! We have the same vision as well. There are some new research directions in the multi-shot setting:

  1. Extending controllable functionalities (such as multimodal conditioning, style transfer) from single-shot to multi-shot settings.
  2. Integrating audio capabilities for multi-shot conversation like Mocha (https://congwei1230.github.io/MoCha/).
  3. Auto-regressive for next-shot generation like Cut2Next (https://vchitect.github.io/Cut2Next-project/), LCT(https://guoyww.github.io/projects/long-context-video/), Mask2Dit (https://tianhao-qi.github.io/Mask2DiTProject/).

The fundamental challenges include efficient implementations, data curation, and computing power.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.03041 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.03041 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.03041 in a Space README.md to link it from this page.

Collections including this paper 1