metadata

license: apache-2.0

SkyReels-A2: Compose Anything in Video Diffusion Transformers

🌐 Github · 👋 Playground · Discord· 🔥 A2-Bench Leaderboard

This repo contains Diffusers style model weights for Skyreels-A2 models. You can find the inference code on SkyReels-A2 repository.

🪄 Models

Models	Download Link	Video Size
A2-Wan2.1-14B-Preview	Huggingface 🤗	~ 81 x 480 x 832
A2-Wan2.1-14B	To be released	~ 81 x 480 x 832
A2-Wan2.1-14B-Infinity	To be released	~ Inf x 720 x 1080

Overview of SkyReels-A2 framework. Our approach initiates by encoding all reference images using two distinct branches. The first, termed the spatial feature branch (represented in red, top row), leverages a fine-grained VAE encoder to process per-composition images. The second, identified as the semantic feature branch (represented in red, bottom row), utilizes a CLIP vision encoder followed by an MLP projection to encode semantic references. Subsequently, the spatial features are concatenated with the noised video tokens along the channel dimension before being passed through the diffusion transformer blocks. Meanwhile, the semantic features extracted from the reference images are incorporated into the diffusion transformers via supplementary cross-attention layers, ensuring that the semantic context is effectively integrated during diffusion.

Some generated results:

Citation

If you find SkyReels-A2 useful for your research, welcome to cite our work using the following BibTeX:

@article{fei2025skyreels,
  title={SkyReels-A2: Compose Anything in Video Diffusion Transformers},
  author={Fei, Zhengcong and Li, Debang and Qiu, Di and Wang, Jiahua and Dou, Yikun and Wang, Rui and Xu, Jingtao and Fan, Mingyuan and Chen, Guibin and Li, Yang and others},
  journal={arXiv preprint arXiv:2504.02436},
  year={2025}
}