Pusa Wan2.2 V1.0 Model

Pusa Wan2.2 V1.0 extends the groundbreaking Pusa paradigm to the advanced Wan2.2-T2V-A14B architecture, featuring a MoE DiT design with separate high-noise and low-noise models. This architecture provides enhanced quality control and generation capabilities while maintaining the revolutionary vectorized timestep adaptation (VTA) approach.

Various tasks in one model, all support 4-step inference with LightX2V: Image-to-Video, Start-End Frames, Video Completion, Video Extension, Text-to-Video, Video Transition, and more...

Example 1: Image-to-Video in 4 Steps

noise: 0.2, high_lora_alpha 1.5	noise: 0.3, high_lora_alpha 1.4
noise: 0.2, high_lora_alpha 1.5	noise: 0.2, high_lora_alpha 1.5

Example 2: Video Extension in 4 Steps

noise: [0.0, 0.3, 0.5, 0.7], high_lora_alpha 1.5

noise: [0.2, 0.4, 0.4, 0.4], high_lora_alpha 1.4

Example 3: Start-End Frames in 4 Steps

noise: [0.2, 0.5], high_lora_alpha 1.5

noise: [0.0, 0.4], high_lora_alpha 1.5

Example 4: Text-to-Video with in 4 Steps

Installation and Usage

Download Weights and Setup

Option 1: Use the Hugging Face CLI:

# Make sure you are in the PusaV1 directory
# Install huggingface-cli if you don't have it
pip install -U "huggingface_hub[cli]"
huggingface-cli download RaphaelLiu/Pusa-Wan2.2-V1 --local-dir ./model_zoo/PusaV1/Wan2.2-Models

# Download base Wan2.2 models if you don't have them
huggingface-cli download Wan-AI/Wan2.2-T2V-A14B --local-dir ./model_zoo/PusaV1/Wan2.2-T2V-A14B

Option 2: Download the LoRA checkpoints directly from this Hugging Face repository.

Usage Examples

Use with Pusa Codebase.

Wan2.2 w/ ⚡ LightX2V Acceleration

LightX2V provides ultra-fast 4-step inference while maintaining generation quality. Compatible with both Wan2.1 and Wan2.2 models.

Key Parameters for LightX2V:

--lightx2v: Enable LightX2V acceleration
--cfg_scale 1: Critical - must be set to 1 for LightX2V
--num_inference_steps 4: Use 4 steps instead of 30
--high_lora_alpha 1.5, --low_lora_alpha 1.4: Recommended value for LightX2V (larger alpha = smaller motion), besides, high_lora_alpha has bigger impact on the output

Example 1: Wan2.2 Image-to-Video with LightX2V

CUDA_VISIBLE_DEVICES=0 python examples/pusavideo/wan22_14b_multi_frames_pusa.py \
  --image_paths "./demos/input_image.jpg" \
  --prompt "A wide-angle shot shows a serene monk meditating perched a top of the letter E of a pile of weathered rocks that vertically spell out 'ZEN'. The rock formation is perched atop a misty mountain peak at sunrise. The warm light bathes the monk in a gentle glow, highlighting the folds of his saffron robes. The sky behind him is a soft gradient of pink and orange, creating a tranquil backdrop. The camera slowly zooms in, capturing the monk's peaceful expression and the intricate details of the rocks. The scene is bathed in a soft, ethereal light, emphasizing the spiritual atmosphere." \
  --cond_position "0" \
  --noise_multipliers "0" \
  --num_inference_steps 4 \
  --high_lora_path "model_zoo/PusaV1/Pusa-Wan2.2-V1/high_noise_pusa.safetensors" \
  --high_lora_alpha 1.5 \
  --low_lora_path "model_zoo/PusaV1/Pusa-Wan2.2-V1/low_noise_pusa.safetensors" \
  --low_lora_alpha 1.4 \
  --cfg_scale 1 \
  --lightx2v

Example 2: Wan2.2 Video Extension with LightX2V

CUDA_VISIBLE_DEVICES=0 python examples/pusavideo/wan22_14b_v2v_pusa.py \
  --video_path "./demos/input_video.mp4" \
  --prompt "piggy bank surfing a tube in teahupo'o wave dusk light cinematic shot shot in 35mm film" \
  --cond_position "0,1,2,3" \
  --noise_multipliers "0.2,0.4,0.4,0.4" \
  --num_inference_steps 4 \
  --high_lora_path "model_zoo/PusaV1/Pusa-Wan2.2-V1/high_noise_pusa.safetensors" \
  --high_lora_alpha 1.5 \
  --low_lora_path "model_zoo/PusaV1/Pusa-Wan2.2-V1/low_noise_pusa.safetensors" \
  --low_lora_alpha 1.4 \
  --cfg_scale 1 \
  --lightx2v

Example 3: Wan2.2 Start-End Frames with LightX2V

CUDA_VISIBLE_DEVICES=0 python examples/pusavideo/wan22_14b_multi_frames_pusa.py \
  --image_paths "./demos/start_frame.jpg" "./demos/end_frame.jpg" \
  --prompt "plastic injection machine opens releasing a soft inflatable foamy morphing sticky figure over a hand. isometric. low light. dramatic light. macro shot. real footage" \
  --cond_position "0,20" \
  --noise_multipliers "0.2,0.5" \
  --num_inference_steps 4 \
  --high_lora_path "model_zoo/PusaV1/Pusa-Wan2.2-V1/high_noise_pusa.safetensors" \
  --high_lora_alpha 1.5 \
  --low_lora_path "model_zoo/PusaV1/Pusa-Wan2.2-V1/low_noise_pusa.safetensors" \
  --low_lora_alpha 1.4 \
  --cfg_scale 1 \
  --lightx2v

Example 4: Wan2.2 Text-to-Video with LightX2V

CUDA_VISIBLE_DEVICES=0 python examples/pusavideo/wan22_14b_text_to_video_pusa.py \
  --prompt "A person is enjoying a meal of spaghetti with a fork in a cozy, dimly lit Italian restaurant. The person has warm, friendly features and is dressed casually but stylishly in jeans and a colorful sweater. They are sitting at a small, round table, leaning slightly forward as they eat with enthusiasm. The spaghetti is piled high on their plate, with some strands hanging over the edge. The background shows soft lighting from nearby candles and a few other diners in the corner, creating a warm and inviting atmosphere. The scene captures a close-up view of the person’s face and hands as they take a bite of spaghetti, with subtle movements of their mouth and fork. The overall style is realistic with a touch of warmth and authenticity, reflecting the comfort of a genuine dining experience." \
  --high_lora_path "model_zoo/PusaV1/Pusa-Wan2.2-V1/high_noise_pusa.safetensors" \
  --high_lora_alpha 1.5 \
  --low_lora_path "model_zoo/PusaV1/Pusa-Wan2.2-V1/low_noise_pusa.safetensors" \
  --low_lora_alpha 1.4 \
  --num_inference_steps 4 \
  --cfg_scale 1 \
  --lightx2v

Key Parameters for Wan2.2

--high_lora_path: Path to high-noise DiT LoRA checkpoint
--low_lora_path: Path to low-noise DiT LoRA checkpoint
--high_lora_alpha: LoRA alpha for high-noise model (recommended: 1.5)
--low_lora_alpha: LoRA alpha for low-noise model (recommended: 1.4)
--lightx2v: Enable LightX2V acceleration
--cfg_scale: Use 1.0 for LightX2V, 3.0 for standard inference

Related Work

FVDM: Introduces the groundbreaking frame-level noise control with vectorized timestep approach that inspired Pusa
Wan2.2-T2V-A14B: The advanced dual DiT base model for this version
LightX2V: Acceleration technique for fast inference
DiffSynth-Studio: Optimized LoRA implementation for efficient training

Citation

If you find our work useful in your research, please consider citing:

@article{liu2025pusa,
  title={PUSA V1. 0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation},
  author={Liu, Yaofang and Ren, Yumeng and Artola, Aitor and Hu, Yuxuan and Cun, Xiaodong and Zhao, Xiaotong and Zhao, Alan and Chan, Raymond H and Zhang, Suiyun and Liu, Rui and others},
  journal={arXiv preprint arXiv:2507.16116},
  year={2025}
}

@article{liu2024redefining,
  title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
  author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
  journal={arXiv preprint arXiv:2410.03160},
  year={2024}
}

RaphaelLiu
/

Pusa-Wan2.2-V1