File size: 5,674 Bytes
f30a30d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 |
---
license: mit
pipeline_tag: image-to-image
---
<h2 align="center"> <a href="https://arxiv.org/abs/2503.14325">LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models</a></h2>
https://github.com/user-attachments/assets/a2a4814a-192b-4cc4-b1a3-d612caa1d872
We present **LeanVAE**, a lightweight Video VAE designed for ultra-efficient video compression and scalable generation in Latent Video Diffusion Models (LVDMs).
- **Lightweight & Efficient**: Only **40M parameters**, significantly reducing computational overhead π
- **Optimized for High-Resolution Videos**: Encodes and decodes a **17-frame 1080p video** in **3 seconds** using only **15GB of GPU memory** *(without tiling inference)* π―
- **State-of-the-Art Video Reconstruction**: Competes with leading Video VAEs π
- **Versatile**: Supports both **images and videos**, preserving **causality in latent space** π½οΈ
- **Evidenced by Diffusion Model**: Enhances visual quality in video generation β¨
---
## π οΈ **Installation**
Clone the repository and install dependencies:
```
git clone https://github.com/westlake-repl/LeanVAE
cd LeanVAE
pip install -r requirements.txt
```
---
## π― **Quick Start**
**Train LeanVAE**
```bash
bash scripts/train.sh
```
**Run Video Reconstruction**
```bash
bash scripts/inference.sh
```
**Evaluate Reconstruction Quality**
```bash
bash scripts/eval.sh
```
---
## π **Pretrained Models**
### Video VAE Model:
| Model | PSNR β¬οΈ | LPIPS β¬οΈ | Params π¦ | TFLOPs β‘ | Checkpoint π₯ |
| ---------------- | ------ | ------- | -------- | -------- | ----------------------------------- |
| **LeanVAE-4ch** | 26.04 | 0.0899 | 39.8M | 0.203 | [LeanVAE-chn4.ckpt](https://huggingface.co/Yumic/LeanVAE/resolve/main/LeanVAE-dim4.ckpt?download=true) |
| **LeanVAE-16ch** | 30.15 | 0.0461 | 39.8M | 0.203 | [LeanVAE-chn16.ckpt](https://huggingface.co/Yumic/LeanVAE/resolve/main/LeanVAE-dim16.ckpt?download=true) |
### Latte Model:
The code and pretrained weights for video generation will be released soon. Stay tuned!
| Model | Dataset | FVD β¬οΈ | Checkpoint π₯ |
| ---------- | ---------- | ---------- | ----------- |
| Latte + LeanVAE-chn4 | SkyTimelapse |49.59 | sky-chn4.ckpt |
| Latte + LeanVAE-chn4 | UCF101 |164.45 | ucf-chn4.ckpt |
| Latte + LeanVAE-chn16 | SkyTimelapse |95.15 | sky-chn16.ckpt |
| Latte + LeanVAE-chn16 | UCF101 |175.33 | ucf-chn16.ckpt |
---
## π§ **Using LeanVAE in Your Project**
```python
from LeanVAE import LeanVAE
# Load pretrained model
model = LeanVAE.load_from_checkpoint("path/to/ckpt", strict=False)
# π Encode & Decode an Image
image, image_rec = model.inference(image)
# πΌοΈ Encode an image β Get latent :
latent = model.encode(image) # (B, C, H, W) β (B, d, 1, H/8, W/8), where d=4 or 16
# πΌοΈ Decode latent representation β Reconstruct image
image = model.decode(latent, is_image=True) # (B, d, 1, H/8, W/8) β (B, C, H, W)
# π Encode & Decode a Video
video, video_rec = model.inference(video) ## Frame count must be 4n+1 (e.g., 5, 9, 13, 17...)
# ποΈ Encode Video β Get Latent Space
latent = model.encode(video) # (B, C, T+1, H, W) β (B, d, T/4+1, H/8, W/8), where d=4 or 16
# ποΈ Decode Latent β Reconstruct Video
video = model.decode(latent) # (B, d, T/4+1, H/8, W/8) β (B, C, T+1, H, W)
# β‘ Enable **Temporal Tiling Inference** for Long Videos
model.set_tile_inference(True)
model.chunksize_enc = 5
model.chunksize_dec = 5
```
---
## π **Preparing Data for Training**
To train LeanVAE, you need to create metadata files listing the video paths, grouped by resolution. Each file contains paths to videos of the same resolution.
```
π data_list
βββ π 96x128.txt π # Contains paths to all 96x128 videos
β βββ /path/to/video_1.mp4
β βββ /path/to/video_2.mp4
β βββ ...
βββ π 256x256.txt π # Contains paths to all 256Γ256 videos
β βββ /path/to/video_3.mp4
β βββ /path/to/video_4.mp4
β βββ ...
βββ π 352x288.txt π # Contains paths to all 352x288 videos
β βββ /path/to/video_5.mp4
β βββ /path/to/video_6.mp4
β βββ ...
```
π Each text file lists video paths corresponding to a specific resolution. Set `args.train_datalist` to the folder containing these files.
---
## π **License**
This project is released under the **MIT License**. See the `LICENSE` file for details.
## π₯ **Why Choose LeanVAE?**
LeanVAE is **fast, lightweight and powerful**, enabling high-quality video compression and generation with minimal computational cost.
If you find this work useful, consider **starring β the repository** and citing our paper!
---
## π **Cite Us**
```bibtex
@misc{cheng2025leanvaeultraefficientreconstructionvae,
title={LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models},
author={Yu Cheng and Fajie Yuan},
year={2025},
eprint={2503.14325},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.14325},
}
```
---
## π **Acknowledgement**
Our work benefits from the contributions of several open-source projects, including [OmniTokenizer](https://github.com/FoundationVision/OmniTokenizer), [Open-Sora-Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan), [VidTok](https://github.com/microsoft/VidTok), and [Latte](https://github.com/Vchitect/Latte). We sincerely appreciate their efforts in advancing research and open-source collaboration! |