EQ-SDXL-VAE: open sourced reproduction of EQ-VAE on SDXL-VAE

Adv-FT is done and achieve better performance than original SDXL-VAE!!!

original paper: https://arxiv.org/abs/2502.09509
source code of the reproduction: https://github.com/KohakuBlueleaf/HakuLatent

Left: original image, Center: latent PCA to 3dim as RGB, Right: decoded image
Upper one is original VAE, bottome one is EQ-VAE finetuned VAE.

Introduction

EQ-VAE, short for Equivariance Regularized VAE, is a novel technique introduced in the paper "Equivariance Regularized Latent Space for Improved Generative Image Modeling" to enhance the latent spaces of autoencoders used in generative image models. The core idea behind EQ-VAE is to address a critical limitation in standard autoencoders: their lack of equivariance to semantic-preserving transformations like scaling and rotation. This non-equivariance results in unnecessarily complex latent spaces, making it harder for subsequent generative models (like diffusion models) to learn efficiently and achieve optimal performance.

This repository provides the model weight of the open-source reproduction of the EQ-VAE method, specifically applied to the SDXL-VAE. SDXL-VAE is a powerful variational autoencoder known for its use in the popular Stable Diffusion XL (SDXL) image generation models. By fine-tuning the pre-trained SDXL-VAE with the EQ-VAE regularization, we aim to create a more structured and semantically meaningful latent space. This should lead to benefits such as:

Improved Generative Performance: A simpler, more equivariant latent space is expected to be easier for generative models to learn from, potentially leading to faster training and improved image quality metrics like FID.
Enhanced Latent Space Structure: EQ-VAE encourages the latent representations to respect spatial transformations, resulting in a smoother and more interpretable latent manifold.
Compatibility with Existing Models: EQ-VAE is designed as a regularization technique that can be applied to pre-trained autoencoders without requiring architectural changes or training from scratch, making it a practical and versatile enhancement.

This reproduction allows you to experiment with EQ-VAE on SDXL-VAE, replicate the findings of the original paper, and potentially leverage the benefits of equivariance regularization in your own generative modeling projects. For a deeper understanding of the theoretical background and experimental results, please refer to the original EQ-VAE paper linked above. The source code in HakuLatent repository provides a straightforward implementation of the EQ-VAE fine-tuning process for any diffusers vae models.

Visual Examples

Left: original image, Center: latent PCA to 3dim as RGB, Right: decoded image
Upper one is original VAE, bottome one is EQ-VAE finetuned VAE.

Usage

This model is heavily finetuned from SDXL-VAE and introduce a totally new latent space. YOU CAN'T USE THIS ON YOUR SDXL MODEL.

You can try to use this VAE to finetune your sdxl model and expect a better final result, but it may require lot of time to achieve it...

To utilize this model in your custom code or setup, use AutoencoderKL class from diffusers library and use:

from diffusers import AutoencoderKL

vae = AutoencoderKL.from_pretrained("KBlueLeaf/EQ-SDXL-VAE").cuda().half()
...

Training Setup

Base Model: SDXL-VAE-fp16-fix
Dataset: ImageNet-1k-resized-256
Batch Size: 128 (bs 8, grad acc 16)
Sample Seen: 3.4M (26500 optimizer step on VAE)
Discriminator: HakuNLayerDiscriminator with n_layer=4
Discriminator startup step: 10000
Reconstruction Loss:
- MSE loss
- LPIPS loss
- ConvNeXt perceptual Loss
loss weights:
- recon loss: 1.0
- adv(disc) loss: 0.5
- kl div loss: 1e-7
For Adv FT
- recon loss: 1.0
  - MSE Loss: 1.5
  - LPIPS Loss: 0.5
  - ConvNeXt perceptual Loss: 2.0
- adv loss: 1.0
- kl div loss: 0.0
  - Encoder freezed

Evaluation Results

We use the validation split and test split (totally 150k images) of imagenet in 256x256 resolution and use MSE loss, PSNR, LPIPS and ConvNeXt perceptual loss as our metric.

Metrics	SDXL-VAE	EQ-SDXL-VAE	EQ-SDXL-VAE Adv FT
MSE Loss	3.683e-3	3.723e-3	3.532e-03
PSNR	24.4698	24.4030	24.6364
LPIPS	0.1316	0.1409	0.1299
ConvNeXt	1.305e-03	1.548e-03	1.322e-03

We can see after the EQ-VAE training without adv loss, the EQ-SDXL-VAE is slightly worse than original VAE.

While After finetuning with Adversarial Loss enabled with Encoder freezed, the PSNR and LPIPS even improved to be better than original VAE!

Note: This repo contains the weight of EQ-SDXL-VAE Adv FT.

Next step

After the training is done, I will try to train a small T2I on it to check if EQ-VAE do help the training of Image Gen models.

Also, I will try to train a simple approximation decoder which have only 2x upscale or no upscale for the latent, for fast experience (if needed)

References

[1] [2502.09509] EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling

[2] madebyollin/sdxl-vae-fp16-fix · Hugging Face

[3] sypsyp97/convnext_perceptual_loss: This package introduces a perceptual loss implementation based on the modern ConvNeXt architecture.

[4] evanarlian/imagenet_1k_resized_256 · Datasets at Hugging Face

Cite

@misc{kohakublueleaf_eq_sdxl_vae,
    author       = {Shih-Ying Yeh (KohakuBlueLeaf)},
    title        = {EQ-SDXL-VAE: Equivariance Regularized SDXL Variational Autoencoder},
    year         = {2024},
    howpublished = {Hugging Face model card},
    url          = {https://huggingface.co/KBlueLeaf/EQ-SDXL-VAE},
    note         = {Finetuned SDXL-VAE with EQ-VAE regularization for improved latent space equivariance.}
}

Acknowledgement

xiaoqianWX: Provide the compute resource.
AmericanPresidentJimmyCarter : Provide implementation of Random Affine transformation.

KBlueLeaf
/

EQ-SDXL-VAE