Diffusers
Safetensors
WanImageToVideoPipeline
SkyReels-A2 / README.md
diqiu7's picture
Update README.md
a6f8009 verified
metadata
license: apache-2.0

SkyReels-A2: Compose Anything in Video Diffusion Transformers

Skyreels Logo

馃寪 Github馃憢 PlaygroundDiscord馃敟 A2-Bench Leaderboard

This repo contains Diffusers style model weights for Skyreels-A2 models. You can find the inference code on SkyReels-A2 repository.

馃獎 Models

Models Download Link Video Size
A2-Wan2.1-14B-Preview Huggingface 馃 ~ 81 x 480 x 832
A2-Wan2.1-14B To be released ~ 81 x 480 x 832
A2-Wan2.1-14B-Infinity To be released ~ Inf x 720 x 1080

image/png

Overview of SkyReels-A2 framework. Our approach initiates by encoding all reference images using two distinct branches. The first, termed the spatial feature branch (represented in red, top row), leverages a fine-grained VAE encoder to process per-composition images. The second, identified as the semantic feature branch (represented in red, bottom row), utilizes a CLIP vision encoder followed by an MLP projection to encode semantic references. Subsequently, the spatial features are concatenated with the noised video tokens along the channel dimension before being passed through the diffusion transformer blocks. Meanwhile, the semantic features extracted from the reference images are incorporated into the diffusion transformers via supplementary cross-attention layers, ensuring that the semantic context is effectively integrated during diffusion.


Some generated results:

Citation

If you find SkyReels-A2 useful for your research, welcome to cite our work using the following BibTeX:

@article{fei2025skyreels,
  title={SkyReels-A2: Compose Anything in Video Diffusion Transformers},
  author={Fei, Zhengcong and Li, Debang and Qiu, Di and Wang, Jiahua and Dou, Yikun and Wang, Rui and Xu, Jingtao and Fan, Mingyuan and Chen, Guibin and Li, Yang and others},
  journal={arXiv preprint arXiv:2504.02436},
  year={2025}
}