AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation
Abstract
AnimeShooter, a reference-guided multi-shot animation dataset, enhances coherent animated video generation by incorporating comprehensive hierarchical annotations and visual consistency, and AnimeShooterGen leverages MLLMs and video diffusion models to achieve superior results.
Recent advances in AI-generated content (AIGC) have significantly accelerated animation production. To produce engaging animations, it is essential to generate coherent multi-shot video clips with narrative scripts and character references. However, existing public datasets primarily focus on real-world scenarios with global descriptions, and lack reference images for consistent character guidance. To bridge this gap, we present AnimeShooter, a reference-guided multi-shot animation dataset. AnimeShooter features comprehensive hierarchical annotations and strong visual consistency across shots through an automated pipeline. Story-level annotations provide an overview of the narrative, including the storyline, key scenes, and main character profiles with reference images, while shot-level annotations decompose the story into consecutive shots, each annotated with scene, characters, and both narrative and descriptive visual captions. Additionally, a dedicated subset, AnimeShooter-audio, offers synchronized audio tracks for each shot, along with audio descriptions and sound sources. To demonstrate the effectiveness of AnimeShooter and establish a baseline for the reference-guided multi-shot video generation task, we introduce AnimeShooterGen, which leverages Multimodal Large Language Models (MLLMs) and video diffusion models. The reference image and previously generated shots are first processed by MLLM to produce representations aware of both reference and context, which are then used as the condition for the diffusion model to decode the subsequent shot. Experimental results show that the model trained on AnimeShooter achieves superior cross-shot visual consistency and adherence to reference visual guidance, which highlight the value of our dataset for coherent animated video generation.
Community
Introducing AnimeShooter ๐ฌ: A reference-guided multi-shot animation dataset for coherent video generation! Our dataset features:
- Hierarchical annotations (story/shot-level) with character reference images
- Strong consistency across shots via automated pipeline
- Audio subset with synchronized tracks (AnimeShooter-audio)
- Baseline model (AnimeShooterGen) combining MLLMs and diffusion models
Resources:
- ๐ฝ๏ธ Project Page: qiulu66.github.io/animeshooter
- ๐ Paper: arxiv.org/abs/2506.03126
- ๐ป Code: github.com/qiulu66/Anime-Shooter
- ๐๏ธ Dataset: huggingface.co/datasets/qiulu66/AnimeShooter
- ๐ค Model: huggingface.co/qiulu66/AnimeShooterGen
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- STORYANCHORS: Generating Consistent Multi-Scene Story Frames for Long-Form Narratives (2025)
- HyperMotion: DiT-Based Pose-Guided Human Image Animation of Complex Motions (2025)
- Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts (2025)
- CineVerse: Consistent Keyframe Synthesis for Cinematic Scene Composition (2025)
- FocusedAD: Character-centric Movie Audio Description (2025)
- ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models (2025)
- HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper