LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models

https://github.com/user-attachments/assets/a2a4814a-192b-4cc4-b1a3-d612caa1d872

We present LeanVAE, a lightweight Video VAE designed for ultra-efficient video compression and scalable generation in Latent Video Diffusion Models (LVDMs).

  • Lightweight & Efficient: Only 40M parameters, significantly reducing computational overhead πŸ“‰
  • Optimized for High-Resolution Videos: Encodes and decodes a 17-frame 1080p video in 3 seconds using only 15GB of GPU memory (without tiling inference) 🎯
  • State-of-the-Art Video Reconstruction: Competes with leading Video VAEs πŸ†
  • Versatile: Supports both images and videos, preserving causality in latent space πŸ“½οΈ
  • Evidenced by Diffusion Model: Enhances visual quality in video generation ✨

πŸ› οΈ Installation

Clone the repository and install dependencies: git clone https://github.com/westlake-repl/LeanVAE cd LeanVAE pip install -r requirements.txt

🎯 Quick Start

Train LeanVAE

bash scripts/train.sh

Run Video Reconstruction

bash scripts/inference.sh

Evaluate Reconstruction Quality bash bash scripts/eval.sh

πŸ“œ Pretrained Models

Video VAE Model:

Model PSNR ⬆️ LPIPS ⬇️ Params πŸ“¦ TFLOPs ⚑ Checkpoint πŸ“₯
LeanVAE-4ch 26.04 0.0899 39.8M 0.203 LeanVAE-chn4.ckpt
LeanVAE-16ch 30.15 0.0461 39.8M 0.203 LeanVAE-chn16.ckpt

Latte Model:

The code and pretrained weights for video generation will be released soon. Stay tuned!

Model Dataset FVD ⬇️ Checkpoint πŸ“₯
Latte + LeanVAE-chn4 SkyTimelapse 49.59 sky-chn4.ckpt
Latte + LeanVAE-chn4 UCF101 164.45 ucf-chn4.ckpt
Latte + LeanVAE-chn16 SkyTimelapse 95.15 sky-chn16.ckpt
Latte + LeanVAE-chn16 UCF101 175.33 ucf-chn16.ckpt

πŸ”§ Using LeanVAE in Your Project

from LeanVAE import LeanVAE

# Load pretrained model
model = LeanVAE.load_from_checkpoint("path/to/ckpt", strict=False)

# πŸ”„ Encode & Decode an Image
image, image_rec = model.inference(image)

# πŸ–ΌοΈ Encode an image β†’ Get latent :  
latent = model.encode(image) # (B, C, H, W) β†’ (B, d, 1, H/8, W/8), where d=4 or 16

# πŸ–ΌοΈ Decode latent representation β†’ Reconstruct image 
image = model.decode(latent, is_image=True) # (B, d, 1, H/8, W/8) β†’ (B, C, H, W)  


# πŸ”„ Encode & Decode a Video
video, video_rec = model.inference(video) ## Frame count must be 4n+1 (e.g., 5, 9, 13, 17...)

# 🎞️ Encode Video β†’ Get Latent Space
latent = model.encode(video)  # (B, C, T+1, H, W) β†’ (B, d, T/4+1, H/8, W/8), where d=4 or 16 

# 🎞️ Decode Latent β†’ Reconstruct Video
video = model.decode(latent) # (B, d, T/4+1, H/8, W/8) β†’ (B, C, T+1, H, W)  

# ⚑ Enable **Temporal Tiling Inference** for Long Videos
model.set_tile_inference(True)
model.chunksize_enc = 5
model.chunksize_dec = 5

πŸ“‚ Preparing Data for Training

To train LeanVAE, you need to create metadata files listing the video paths, grouped by resolution. Each file contains paths to videos of the same resolution.

πŸ“‚ data_list
 β”œβ”€β”€ πŸ“„ 96x128.txt  πŸ“œ  # Contains paths to all 96x128 videos
 β”‚   β”œβ”€β”€ /path/to/video_1.mp4
 β”‚   β”œβ”€β”€ /path/to/video_2.mp4
 β”‚   β”œβ”€β”€ ...
 β”œβ”€β”€ πŸ“„ 256x256.txt  πŸ“œ  # Contains paths to all 256Γ—256 videos
 β”‚   β”œβ”€β”€ /path/to/video_3.mp4
 β”‚   β”œβ”€β”€ /path/to/video_4.mp4
 β”‚   β”œβ”€β”€ ...
 β”œβ”€β”€ πŸ“„ 352x288.txt  πŸ“œ  # Contains paths to all 352x288 videos
 β”‚   β”œβ”€β”€ /path/to/video_5.mp4
 β”‚   β”œβ”€β”€ /path/to/video_6.mp4
 β”‚   β”œβ”€β”€ ...

πŸ“Œ Each text file lists video paths corresponding to a specific resolution. Set args.train_datalist to the folder containing these files.


πŸ“œ License

This project is released under the MIT License. See the LICENSE file for details.

πŸ”₯ Why Choose LeanVAE?

LeanVAE is fast, lightweight and powerful, enabling high-quality video compression and generation with minimal computational cost.

If you find this work useful, consider starring ⭐ the repository and citing our paper!


πŸ“ Cite Us

@misc{cheng2025leanvaeultraefficientreconstructionvae,
      title={LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models}, 
      author={Yu Cheng and Fajie Yuan},
      year={2025},
      eprint={2503.14325},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.14325}, 
}

πŸ‘ Acknowledgement

Our work benefits from the contributions of several open-source projects, including OmniTokenizer, Open-Sora-Plan, VidTok, and Latte. We sincerely appreciate their efforts in advancing research and open-source collaboration!

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.