File size: 5,674 Bytes
f30a30d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
license: mit
pipeline_tag: image-to-image
---

<h2 align="center"> <a href="https://arxiv.org/abs/2503.14325">LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models</a></h2>

https://github.com/user-attachments/assets/a2a4814a-192b-4cc4-b1a3-d612caa1d872

We present **LeanVAE**, a lightweight Video VAE designed for ultra-efficient video compression and scalable generation in Latent Video Diffusion Models (LVDMs).

- **Lightweight & Efficient**: Only **40M parameters**, significantly reducing computational overhead πŸ“‰  
- **Optimized for High-Resolution Videos**: Encodes and decodes a **17-frame 1080p video** in **3 seconds** using only **15GB of GPU memory** *(without tiling inference)* 🎯  
- **State-of-the-Art Video Reconstruction**: Competes with leading Video VAEs πŸ†  
- **Versatile**: Supports both **images and videos**, preserving **causality in latent space** πŸ“½οΈ  
- **Evidenced by Diffusion Model**: Enhances visual quality in video generation ✨  

---
## πŸ› οΈ **Installation**
Clone the repository and install dependencies:
```
git clone https://github.com/westlake-repl/LeanVAE
cd LeanVAE
pip install -r requirements.txt
```
---
## 🎯 **Quick Start** 
**Train LeanVAE**
```bash
bash scripts/train.sh
```

**Run Video Reconstruction**
```bash
bash scripts/inference.sh
```

**Evaluate Reconstruction Quality**
```bash
bash scripts/eval.sh
```
---

## πŸ“œ **Pretrained Models**
### Video VAE Model:
| Model            | PSNR ⬆️ | LPIPS ⬇️ | Params πŸ“¦ | TFLOPs ⚑ | Checkpoint πŸ“₯                        |
| ---------------- | ------ | ------- | -------- | -------- | ----------------------------------- |
| **LeanVAE-4ch**  | 26.04  | 0.0899  | 39.8M    | 0.203    | [LeanVAE-chn4.ckpt](https://huggingface.co/Yumic/LeanVAE/resolve/main/LeanVAE-dim4.ckpt?download=true) |
| **LeanVAE-16ch** | 30.15  | 0.0461  | 39.8M    | 0.203    | [LeanVAE-chn16.ckpt](https://huggingface.co/Yumic/LeanVAE/resolve/main/LeanVAE-dim16.ckpt?download=true) |

 
### Latte Model:
The code and pretrained weights for video generation will be released soon. Stay tuned!
| Model                    | Dataset      | FVD ⬇️  | Checkpoint πŸ“₯                        |
| ---------- | ---------- | ---------- | ----------- |
| Latte + LeanVAE-chn4 | SkyTimelapse |49.59 | sky-chn4.ckpt | 
| Latte + LeanVAE-chn4 | UCF101 |164.45 | ucf-chn4.ckpt |
| Latte + LeanVAE-chn16 | SkyTimelapse |95.15 | sky-chn16.ckpt |
| Latte + LeanVAE-chn16 | UCF101 |175.33 | ucf-chn16.ckpt |

---
## πŸ”§ **Using LeanVAE in Your Project**

```python
from LeanVAE import LeanVAE

# Load pretrained model
model = LeanVAE.load_from_checkpoint("path/to/ckpt", strict=False)

# πŸ”„ Encode & Decode an Image
image, image_rec = model.inference(image)

# πŸ–ΌοΈ Encode an image β†’ Get latent :  
latent = model.encode(image) # (B, C, H, W) β†’ (B, d, 1, H/8, W/8), where d=4 or 16

# πŸ–ΌοΈ Decode latent representation β†’ Reconstruct image 
image = model.decode(latent, is_image=True) # (B, d, 1, H/8, W/8) β†’ (B, C, H, W)  


# πŸ”„ Encode & Decode a Video
video, video_rec = model.inference(video) ## Frame count must be 4n+1 (e.g., 5, 9, 13, 17...)

# 🎞️ Encode Video β†’ Get Latent Space
latent = model.encode(video)  # (B, C, T+1, H, W) β†’ (B, d, T/4+1, H/8, W/8), where d=4 or 16 

# 🎞️ Decode Latent β†’ Reconstruct Video
video = model.decode(latent) # (B, d, T/4+1, H/8, W/8) β†’ (B, C, T+1, H, W)  

# ⚑ Enable **Temporal Tiling Inference** for Long Videos
model.set_tile_inference(True)
model.chunksize_enc = 5
model.chunksize_dec = 5
```
---

## πŸ“‚ **Preparing Data for Training**
To train LeanVAE, you need to create metadata files listing the video paths, grouped by resolution. Each file contains paths to videos of the same resolution.
```
πŸ“‚ data_list
 β”œβ”€β”€ πŸ“„ 96x128.txt  πŸ“œ  # Contains paths to all 96x128 videos
 β”‚   β”œβ”€β”€ /path/to/video_1.mp4
 β”‚   β”œβ”€β”€ /path/to/video_2.mp4
 β”‚   β”œβ”€β”€ ...
 β”œβ”€β”€ πŸ“„ 256x256.txt  πŸ“œ  # Contains paths to all 256Γ—256 videos
 β”‚   β”œβ”€β”€ /path/to/video_3.mp4
 β”‚   β”œβ”€β”€ /path/to/video_4.mp4
 β”‚   β”œβ”€β”€ ...
 β”œβ”€β”€ πŸ“„ 352x288.txt  πŸ“œ  # Contains paths to all 352x288 videos
 β”‚   β”œβ”€β”€ /path/to/video_5.mp4
 β”‚   β”œβ”€β”€ /path/to/video_6.mp4
 β”‚   β”œβ”€β”€ ...
```
πŸ“Œ Each text file lists video paths corresponding to a specific resolution. Set `args.train_datalist` to the folder containing these files.


---
## πŸ“œ **License**

This project is released under the **MIT License**. See the `LICENSE` file for details.


## πŸ”₯ **Why Choose LeanVAE?**  
LeanVAE is **fast, lightweight and powerful**, enabling high-quality video compression and generation with minimal computational cost.  

If you find this work useful, consider **starring ⭐ the repository** and citing our paper!  

---

## πŸ“ **Cite Us**  
```bibtex
@misc{cheng2025leanvaeultraefficientreconstructionvae,
      title={LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models}, 
      author={Yu Cheng and Fajie Yuan},
      year={2025},
      eprint={2503.14325},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.14325}, 
}
```
---

## πŸ‘ **Acknowledgement**
Our work benefits from the contributions of several open-source projects, including [OmniTokenizer](https://github.com/FoundationVision/OmniTokenizer), [Open-Sora-Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan), [VidTok](https://github.com/microsoft/VidTok), and [Latte](https://github.com/Vchitect/Latte). We sincerely appreciate their efforts in advancing research and open-source collaboration!