Diffusers
Safetensors
WanPipeline
BrianChen1129's picture
Update README.md
26689c7 verified
|
raw
history blame
2.32 kB
metadata
license: apache-2.0

FastVideo Wan2.1-VSA-T2V-14B-720P-Diffusers

Model Overview

  • This model is finetuned with VSA, based on Wan-AI/Wan2.1-T2V-14B-Diffusers.
  • It achieves up to 2.1x speed up on a single H100 GPU.
  • Our model is trained on 77×768×1280 resolution, but it supports generating videos with any resolution.(quality may degrade).
  • We set VSA attention sparsity to 0.9, and training runs for 1500 steps (~14 hours). You can tune this value from 0 to 0.9 to balance speed and performance for inference.
  • Both finetuning and inference scripts are available in the FastVideo repository.
  • Try it out on FastVideo — we support a wide range of GPUs from H100 to 4090
  • We use FastVideo 720P Synthetic Wan dataset for training.

If you use Wan2.1-VSA-T2V-14B-720P-Diffusers model for your research, please cite our paper:

@article{zhang2025vsa,
  title={VSA: Faster Video Diffusion with Trainable Sparse Attention},
  author={Zhang, Peiyuan and Huang, Haofeng and Chen, Yongqi and Lin, Will and Liu, Zhengzhong and Stoica, Ion and Xing, Eric and Zhang, Hao},
  journal={arXiv preprint arXiv:2505.13389},
  year={2025}
}
@article{zhang2025fast,
  title={Fast video generation with sliding tile attention},
  author={Zhang, Peiyuan and Chen, Yongqi and Su, Runlong and Ding, Hangliang and Stoica, Ion and Liu, Zhengzhong and Zhang, Hao},
  journal={arXiv preprint arXiv:2502.04507},
  year={2025}
}