--- license: apache-2.0 datasets: - RaphaelLiu/PusaV1_training base_model: - Wan-AI/Wan2.2-T2V-A14B tags: - image-to-video - start-end-frames - text-to-video - video-to-video - video-extension --- # Pusa Wan2.2 V1.0 Model [Code Repository](https://github.com/Yaofang-Liu/Pusa-VidGen) | [Project Page](https://yaofang-liu.github.io/Pusa_Web/) | [Dataset](https://huggingface.co/datasets/RaphaelLiu/PusaV1_training) | [Wan2.1 Model](https://huggingface.co/RaphaelLiu/PusaV1) | [Paper (Pusa V1.0)](https://arxiv.org/abs/2507.16116) | [Paper (FVDM)](https://arxiv.org/abs/2410.03160) | [Follow on X](https://x.com/stephenajason) | [Xiaohongshu](https://www.xiaohongshu.com/user/profile/5c6f928f0000000010015ca1?xsec_token=YBEf_x-s5bOBQIMJuNQvJ6H23Anwey1nnDgC9wiLyDHPU=&xsec_source=app_share&xhsshare=CopyLink&appuid=5c6f928f0000000010015ca1&apptime=1752622393&share_id=60f9a8041f974cb7ac5e3f0f161bf748) **Pusa Wan2.2 V1.0** extends the groundbreaking Pusa paradigm to the advanced **Wan2.2-T2V-A14B** architecture, featuring a **MoE DiT design** with separate high-noise and low-noise models. This architecture provides enhanced quality control and generation capabilities while maintaining the revolutionary **vectorized timestep adaptation (VTA)** approach. **Various tasks in one model, all support 4-step inference with LightX2V**: Image-to-Video, Start-End Frames, Video Completion, Video Extension, Text-to-Video, Video Transition, and more... **Example 1: Image-to-Video in 4 Steps**

noise: 0.2, high_lora_alpha 1.5	noise: 0.3, high_lora_alpha 1.4
noise: 0.2, high_lora_alpha 1.5	noise: 0.2, high_lora_alpha 1.5

**Example 2: Video Extension in 4 Steps**

noise: [0.0, 0.3, 0.5, 0.7], high_lora_alpha 1.5

noise: [0.2, 0.4, 0.4, 0.4], high_lora_alpha 1.4

**Example 3: Start-End Frames in 4 Steps**

noise: [0.2, 0.5], high_lora_alpha 1.5

noise: [0.0, 0.4], high_lora_alpha 1.5

**Example 4: Text-to-Video with in 4 Steps**

## Installation and Usage ### Download Weights and Setup **Option 1**: Use the Hugging Face CLI: ```shell # Make sure you are in the PusaV1 directory # Install huggingface-cli if you don't have it pip install -U "huggingface_hub[cli]" huggingface-cli download RaphaelLiu/Pusa-Wan2.2-V1 --local-dir ./model_zoo/PusaV1/Wan2.2-Models # Download base Wan2.2 models if you don't have them huggingface-cli download Wan-AI/Wan2.2-T2V-A14B --local-dir ./model_zoo/PusaV1/Wan2.2-T2V-A14B ``` **Option 2**: Download the LoRA checkpoints directly from [this Hugging Face repository](https://huggingface.co/RaphaelLiu/Pusa-Wan2.2-V1). ### Usage Examples Use with [Pusa Codebase](https://github.com/Yaofang-Liu/Pusa-VidGen). ### Wan2.2 w/ ⚡ LightX2V Acceleration LightX2V provides ultra-fast 4-step inference while maintaining generation quality. Compatible with both Wan2.1 and Wan2.2 models. **Key Parameters for LightX2V:** - `--lightx2v`: Enable LightX2V acceleration - `--cfg_scale 1`: **Critical** - must be set to 1 for LightX2V - `--num_inference_steps 4`: Use 4 steps instead of 30 - `--high_lora_alpha 1.5, --low_lora_alpha 1.4`: Recommended value for LightX2V (larger alpha = smaller motion), besides, high_lora_alpha has bigger impact on the output **Example 1: Wan2.2 Image-to-Video with LightX2V** ```shell CUDA_VISIBLE_DEVICES=0 python examples/pusavideo/wan22_14b_multi_frames_pusa.py \ --image_paths "./demos/input_image.jpg" \ --prompt "A wide-angle shot shows a serene monk meditating perched a top of the letter E of a pile of weathered rocks that vertically spell out 'ZEN'. The rock formation is perched atop a misty mountain peak at sunrise. The warm light bathes the monk in a gentle glow, highlighting the folds of his saffron robes. The sky behind him is a soft gradient of pink and orange, creating a tranquil backdrop. The camera slowly zooms in, capturing the monk's peaceful expression and the intricate details of the rocks. The scene is bathed in a soft, ethereal light, emphasizing the spiritual atmosphere." \ --cond_position "0" \ --noise_multipliers "0" \ --num_inference_steps 4 \ --high_lora_path "model_zoo/PusaV1/Pusa-Wan2.2-V1/high_noise_pusa.safetensors" \ --high_lora_alpha 1.5 \ --low_lora_path "model_zoo/PusaV1/Pusa-Wan2.2-V1/low_noise_pusa.safetensors" \ --low_lora_alpha 1.4 \ --cfg_scale 1 \ --lightx2v ``` **Example 2: Wan2.2 Video Extension with LightX2V** ```shell CUDA_VISIBLE_DEVICES=0 python examples/pusavideo/wan22_14b_v2v_pusa.py \ --video_path "./demos/input_video.mp4" \ --prompt "piggy bank surfing a tube in teahupo'o wave dusk light cinematic shot shot in 35mm film" \ --cond_position "0,1,2,3" \ --noise_multipliers "0.2,0.4,0.4,0.4" \ --num_inference_steps 4 \ --high_lora_path "model_zoo/PusaV1/Pusa-Wan2.2-V1/high_noise_pusa.safetensors" \ --high_lora_alpha 1.5 \ --low_lora_path "model_zoo/PusaV1/Pusa-Wan2.2-V1/low_noise_pusa.safetensors" \ --low_lora_alpha 1.4 \ --cfg_scale 1 \ --lightx2v ``` **Example 3: Wan2.2 Start-End Frames with LightX2V** ```shell CUDA_VISIBLE_DEVICES=0 python examples/pusavideo/wan22_14b_multi_frames_pusa.py \ --image_paths "./demos/start_frame.jpg" "./demos/end_frame.jpg" \ --prompt "plastic injection machine opens releasing a soft inflatable foamy morphing sticky figure over a hand. isometric. low light. dramatic light. macro shot. real footage" \ --cond_position "0,20" \ --noise_multipliers "0.2,0.5" \ --num_inference_steps 4 \ --high_lora_path "model_zoo/PusaV1/Pusa-Wan2.2-V1/high_noise_pusa.safetensors" \ --high_lora_alpha 1.5 \ --low_lora_path "model_zoo/PusaV1/Pusa-Wan2.2-V1/low_noise_pusa.safetensors" \ --low_lora_alpha 1.4 \ --cfg_scale 1 \ --lightx2v ``` **Example 4: Wan2.2 Text-to-Video with LightX2V** ```shell CUDA_VISIBLE_DEVICES=0 python examples/pusavideo/wan22_14b_text_to_video_pusa.py \ --prompt "A person is enjoying a meal of spaghetti with a fork in a cozy, dimly lit Italian restaurant. The person has warm, friendly features and is dressed casually but stylishly in jeans and a colorful sweater. They are sitting at a small, round table, leaning slightly forward as they eat with enthusiasm. The spaghetti is piled high on their plate, with some strands hanging over the edge. The background shows soft lighting from nearby candles and a few other diners in the corner, creating a warm and inviting atmosphere. The scene captures a close-up view of the person’s face and hands as they take a bite of spaghetti, with subtle movements of their mouth and fork. The overall style is realistic with a touch of warmth and authenticity, reflecting the comfort of a genuine dining experience." \ --high_lora_path "model_zoo/PusaV1/Pusa-Wan2.2-V1/high_noise_pusa.safetensors" \ --high_lora_alpha 1.5 \ --low_lora_path "model_zoo/PusaV1/Pusa-Wan2.2-V1/low_noise_pusa.safetensors" \ --low_lora_alpha 1.4 \ --num_inference_steps 4 \ --cfg_scale 1 \ --lightx2v ``` ### Key Parameters for Wan2.2 - **`--high_lora_path`**: Path to high-noise DiT LoRA checkpoint - **`--low_lora_path`**: Path to low-noise DiT LoRA checkpoint - **`--high_lora_alpha`**: LoRA alpha for high-noise model (recommended: 1.5) - **`--low_lora_alpha`**: LoRA alpha for low-noise model (recommended: 1.4) - **`--lightx2v`**: Enable LightX2V acceleration - **`--cfg_scale`**: Use 1.0 for LightX2V, 3.0 for standard inference ## Related Work - [FVDM](https://arxiv.org/abs/2410.03160): Introduces the groundbreaking frame-level noise control with vectorized timestep approach that inspired Pusa - [Wan2.2-T2V-A14B](https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B): The advanced dual DiT base model for this version - [LightX2V](https://github.com/ModelTC/LightX2V): Acceleration technique for fast inference - [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio): Optimized LoRA implementation for efficient training ## Citation If you find our work useful in your research, please consider citing: ```bibtex @article{liu2025pusa, title={PUSA V1. 0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation}, author={Liu, Yaofang and Ren, Yumeng and Artola, Aitor and Hu, Yuxuan and Cun, Xiaodong and Zhao, Xiaotong and Zhao, Alan and Chan, Raymond H and Zhang, Suiyun and Liu, Rui and others}, journal={arXiv preprint arXiv:2507.16116}, year={2025} } ``` ```bibtex @article{liu2024redefining, title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach}, author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel}, journal={arXiv preprint arXiv:2410.03160}, year={2024} }