EQ-SDXL-VAE: open sourced reproduction of EQ-VAE on SDXL-VAE
Training is still in progress, there may have more updates in the future
original paper: https://arxiv.org/abs/2502.09509
source code of the reproduction: https://github.com/KohakuBlueleaf/HakuLatent
Left: original image, Center: latent PCA to 3dim as RGB, Right: decoded image
Upper one is original VAE, bottome one is EQ-VAE finetuned VAE.
Introduction
EQ-VAE, short for Equivariance Regularized VAE, is a novel technique introduced in the paper "Equivariance Regularized Latent Space for Improved Generative Image Modeling" to enhance the latent spaces of autoencoders used in generative image models. The core idea behind EQ-VAE is to address a critical limitation in standard autoencoders: their lack of equivariance to semantic-preserving transformations like scaling and rotation. This non-equivariance results in unnecessarily complex latent spaces, making it harder for subsequent generative models (like diffusion models) to learn efficiently and achieve optimal performance.
This repository provides the model weight of the open-source reproduction of the EQ-VAE method, specifically applied to the SDXL-VAE. SDXL-VAE is a powerful variational autoencoder known for its use in the popular Stable Diffusion XL (SDXL) image generation models. By fine-tuning the pre-trained SDXL-VAE with the EQ-VAE regularization, we aim to create a more structured and semantically meaningful latent space. This should lead to benefits such as:
- Improved Generative Performance: A simpler, more equivariant latent space is expected to be easier for generative models to learn from, potentially leading to faster training and improved image quality metrics like FID.
- Enhanced Latent Space Structure: EQ-VAE encourages the latent representations to respect spatial transformations, resulting in a smoother and more interpretable latent manifold.
- Compatibility with Existing Models: EQ-VAE is designed as a regularization technique that can be applied to pre-trained autoencoders without requiring architectural changes or training from scratch, making it a practical and versatile enhancement.
This reproduction allows you to experiment with EQ-VAE on SDXL-VAE, replicate the findings of the original paper, and potentially leverage the benefits of equivariance regularization in your own generative modeling projects. For a deeper understanding of the theoretical background and experimental results, please refer to the original EQ-VAE paper linked above. The source code in HakuLatent repository provides a straightforward implementation of the EQ-VAE fine-tuning process for any diffusers vae models.
Visual Examples
Left: original image, Center: latent PCA to 3dim as RGB, Right: decoded image
Upper one is original VAE, bottome one is EQ-VAE finetuned VAE.
Usage
This model is heavily finetuned from SDXL-VAE and introduce a totally new latent space. YOU CAN'T USE THIS ON YOUR SDXL MODEL.
You can try to use this VAE to finetune your sdxl model and expect a better final result, but it may require lot of time to achieve it...
To utilize this model in your custom code or setup, use AutoencoderKL
class from diffusers library and use:
from diffusers import AutoencoderKL
vae = AutoencoderKL.from_pretrained("KBlueLeaf/EQ-SDXL-VAE").cuda().half()
...
Training Setup
- Base Model: SDXL-VAE-fp16-fix
- Dataset: ImageNet-1k-resized-256
- Batch Size: 128 (bs 8, grad acc 16)
- Sample Seen: 3.4M (26500 optimizer step on VAE)
- Discriminator: HakuNLayerDiscriminator with n_layer=4
- Discriminator startup step: 10000
- Reconstruction Loss:
- MSE loss
- LPIPS loss
- ConvNeXt perceptual Loss
- loss weights:
- recon loss: 1.0
- adv(disc) loss: 0.5
- kl div loss: 1e-7
Evaluation Results
We use the validation split of imagenet in 256x256 resolution and use MSE loss, PSNR, LPIPS and ConvNeXt perceptual loss as our metric.
Metrics | SDXL-VAE | EQ-SDXL-VAE |
---|---|---|
MSE Loss | 3.681e-3 | 3.720e-3 |
PSNR | 24.6602 | 24.5649 |
LPIPS | 0.1314 | 0.1407 |
ConvNeXt | 1.303e-03 | 1.546e-03 |
Based on the result of original paper, the degradation of performance is somehow expected. Since EQ should be seen as kind of reguarlization, which require model to have more capacity to achieve same performance while maintain some specific property.
If we count the "quality of Latent" into account, the EQ-SDXL-VAE definitely overtake the original SDXL-VAE
Next step
After the training is done, I will try to train a small T2I on it to check if EQ-VAE do help the training of Image Gen models.
Also, I will try to train a simple approximation decoder which have only 2x upscale or no upscale for the latent, for fast experience (if needed)
References
[1] [2502.09509] EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling
[2] madebyollin/sdxl-vae-fp16-fix · Hugging Face
[4] evanarlian/imagenet_1k_resized_256 · Datasets at Hugging Face
Cite
@misc{kohakublueleaf_eq_sdxl_vae,
author = {Shih-Ying Yeh (KohakuBlueLeaf)},
title = {EQ-SDXL-VAE: Equivariance Regularized SDXL Variational Autoencoder},
year = {2024},
howpublished = {Hugging Face model card},
url = {https://huggingface.co/KBlueLeaf/EQ-SDXL-VAE},
note = {Finetuned SDXL-VAE with EQ-VAE regularization for improved latent space equivariance.}
}
Acknowledgement
- xiaoqianWX: Provide the compute resource.
- Downloads last month
- 111
Model tree for KBlueLeaf/EQ-SDXL-VAE
Base model
madebyollin/sdxl-vae-fp16-fix