Diffusers
Safetensors
WanPipeline
File size: 3,933 Bytes
8c044e9
 
 
 
 
56a652f
c505014
56a652f
8c044e9
 
 
 
 
 
fe26e60
8c044e9
 
 
 
 
32473a8
fbbb374
629816d
2f169c9
b94c1f9
 
 
 
04af245
9664edc
04af245
 
 
 
 
 
 
 
 
3646c57
04af245
 
 
 
 
 
3223c2c
 
 
 
e64171f
 
 
04af245
 
3223c2c
eba5e1d
04af245
3223c2c
04af245
 
3223c2c
 
04af245
8c044e9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
---
license: apache-2.0
---

# FastVideo Wan2.1-VSA-T2V-14B-720P-Diffusers 
<p align="center">
  <img src="https://raw.githubusercontent.com/hao-ai-lab/FastVideo/main/assets/logo.png" width="200"/>
</p>
<div>
  <div align="center">
    <a href="https://github.com/hao-ai-lab/FastVideo" target="_blank">FastVideo Team</a>&emsp;
  </div>
  <div align="center">
    <a href="https://arxiv.org/pdf/2505.13389">Paper</a> | 
    <a href="https://github.com/hao-ai-lab/FastVideo">Github</a>
  </div>
</div>


## Model Overview
- This model is finetuned with [VSA](https://arxiv.org/pdf/2505.13389), based on [Wan-AI/Wan2.1-T2V-14B-Diffusers](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B-Diffusers).  
- It achieves up to 2.1x speed up on a single **H100** GPU.
- Our model is trained on **77×768×1280** resolution, but it supports generating videos with **any resolution**.(quality may degrade).
- We set **VSA attention sparsity** to 0.9, and training runs for **1500 steps (~14 hours)**. You can tune this value from 0 to 0.9 to balance speed and performance for inference.
- Finetuning and inference scripts are available in the [FastVideo](https://github.com/hao-ai-lab/FastVideo) repository:  
  - [1 Node/GPU debugging finetuning script](https://github.com/hao-ai-lab/FastVideo/blob/main/scripts/finetune/finetune_v1_VSA.sh)
  - [Slurm training example script](https://github.com/hao-ai-lab/FastVideo/blob/main/examples/training/finetune/Wan2.1-VSA/Wan-Syn-Data/T2V-14B-VSA.slurm)  
  - [Inference script](https://github.com/hao-ai-lab/FastVideo/blob/main/scripts/inference/v1_inference_wan_VSA.sh)
```python
# install FastVideo and VSA first
git clone https://github.com/hao-ai-lab/FastVideo
pip install -e .
cd csrc/attn
git submodule update --init --recursive
python setup_vsa.py install

num_gpus=1
export FASTVIDEO_ATTENTION_BACKEND=VIDEO_SPARSE_ATTN
# change model path to local dir if you want to inference using your checkpoint
export MODEL_BASE=FastVideo/Wan2.1-VSA-T2V-14B-720P-Diffusers
# export MODEL_BASE=hunyuanvideo-community/HunyuanVideo 
fastvideo generate \
    --model-path $MODEL_BASE \
    --sp-size $num_gpus \
    --tp-size 1 \
    --num-gpus $num_gpus \
    --dit-cpu-offload False \
    --vae-cpu-offload False \
    --text-encoder-cpu-offload True \
    --pin-cpu-memory False \
    --height 720 \
    --width 1280 \
    --num-frames 81 \
    --num-inference-steps 50 \
    --fps 16 \
    --guidance-scale 5.0 \
    --flow-shift 5.0 \
    --VSA-sparsity 0.9 \
    --prompt-txt assets/prompt.txt \
    --negative-prompt "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards" \
    --seed 1024 \
    --output-path outputs_Wan-VSA-14B/ \
    --enable_torch_compile
```
- Try it out on **FastVideo** — we support a wide range of GPUs from **H100** to **4090**
- We use [FastVideo 720P Synthetic Wan dataset](https://huggingface.co/datasets/FastVideo/Wan-Syn_77x768x1280_250k) for training.



If you use Wan2.1-VSA-T2V-14B-720P-Diffusers model for your research, please cite our paper:
```
@article{zhang2025vsa,
  title={VSA: Faster Video Diffusion with Trainable Sparse Attention},
  author={Zhang, Peiyuan and Huang, Haofeng and Chen, Yongqi and Lin, Will and Liu, Zhengzhong and Stoica, Ion and Xing, Eric and Zhang, Hao},
  journal={arXiv preprint arXiv:2505.13389},
  year={2025}
}
@article{zhang2025fast,
  title={Fast video generation with sliding tile attention},
  author={Zhang, Peiyuan and Chen, Yongqi and Su, Runlong and Ding, Hangliang and Stoica, Ion and Liu, Zhengzhong and Zhang, Hao},
  journal={arXiv preprint arXiv:2502.04507},
  year={2025}
}
```