e2e-qwenimage-vae / README.md

Update README.md

9cb4c88 verified about 1 month ago

5.35 kB

	---
	license: mit
	pipeline_tag: image-to-image
	library_name: diffusers
	---

	<h1 align="center">
	🚀 REPA-E <em>for</em> T2I
	</h1>

	<p align="center">
	<em>End-to-End Tuned VAEs for Supercharging Text-to-Image Diffusion Transformers</em>
	</p>

	<p align="center">
	<a href="https://End2End-Diffusion.github.io/repa-e-t2i">🌐 Project Page</a> &ensp;
	<a href="https://huggingface.co/REPA-E/models">🤗 Models</a> &ensp;
	<a href="https://arxiv.org/abs/2504.10483">📃 Paper</a> &ensp;
	<br><br>
	<!-- <a href="https://paperswithcode.com/sota/image-generation-on-imagenet-256x256?p=repa-e-unlocking-vae-for-end-to-end-tuning-of"><img src="https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/repa-e-unlocking-vae-for-end-to-end-tuning-of/image-generation-on-imagenet-256x256" alt="PWC"></a> -->
	</p>

	<!-- <p align="center">
	<a href="https://scholar.google.com.au/citations?user=GQzvqS4AAAAJ" target="_blank">Xingjian Leng</a><sup>1,2*</sup> &ensp; <b>·</b> &ensp;
	<a href="https://1jsingh.github.io/" target="_blank">Jaskirat Singh</a><sup>1</sup> &ensp; <b>·</b> &ensp;
	<a href="https://rynmurdock.github.io/" target="_blank">Ryan Murdock</a><sup>2</sup> &ensp; <b>·</b> &ensp;
	<a href="https://www.ethansmith2000.com/" target="_blank">Ethan Smith</a><sup>2</sup> &ensp; <b>·</b> &ensp;
	<a href="https://xiaoyang-rebecca.github.io/cv/" target="_blank">Rebecca Li</a><sup>2</sup> &ensp; <b>·</b> &ensp;
	<a href="https://www.sainingxie.com/" target="_blank">Saining Xie</a><sup>3</sup>&ensp; <b>·</b> &ensp;
	<a href="https://zheng-lab-anu.github.io/" target="_blank">Liang Zheng</a><sup>1</sup>&ensp;
	</p>

	<p align="center">
	<sup>1</sup> Australian National University &emsp; <sup>2</sup>Canva &emsp; <sup>3</sup>New York University &emsp; <br>
	<sub><sup>*</sup>Done during internship at Canva &emsp;</sub>
	</p>

	<p align="center">
	<a href="https://arxiv.org/abs/2504.10483" target="_blank">📄 REPA-E Paper</a> &ensp; \| &ensp;
	<a href="https://end2end-diffusion.github.io/repa-e-t2i/" target="_blank">🌐 Blog Post</a> &ensp; \| &ensp;
	<a href="https://huggingface.co/REPA-E" target="_blank">🤗 Models</a>
	</p> -->

	---

	## 🚀 Overall

	<p>
	We present REPA-E for T2I, a family of end-to-end tuned VAEs designed to supercharge text-to-image generation training. These models consistently outperform Qwen-Image-VAE across all benchmarks (COCO-30K, DPG-Bench, GenAI-Bench, GenEval, and MJHQ-30K) without requiring any additional representation alignment losses.
	</p>

	<p>
	For training, we adopt the <a href="https://github.com/End2End-Diffusion/REPA-E" target="_blank"><strong>official REPA-E training code</strong></a> to optimize the
	<a href="https://huggingface.co/Qwen/Qwen-Image" target="_blank">Qwen-Image-VAE</a> for <strong>80 epochs</strong> with a batch size of <strong>256</strong> on the <strong>ImageNet-256</strong> dataset.
	The REPA-E training effectively refines the VAE’s latent-space structure and enables faster convergence in downstream text-to-image latent diffusion model training.
	</p>

	<p>
	This repository provides <code>diffusers</code>-compatible weights for the <strong>end-to-end trained Qwen-Image-VAE</strong>. In addition, we release <strong>end-to-end trained variants</strong> of several other widely used VAEs to facilitate research and integration within text-to-image diffusion frameworks.
	</p>

	## ⚡️ Quickstart
	```python
	from diffusers import AutoencoderKLQwenImage

	vae = AutoencoderKLQwenImage.from_pretrained("REPA-E/e2e-qwenimage-vae").to("cuda")
	```
	> Use `vae.encode(...)` / `vae.decode(...)` in your pipeline. (A full example is provided below.)

	### 🧩 End-to-End Trained VAE Releases

	\| Model \| Hugging Face Link \|
	\|-------\|-------------------\|
	\| E2E-FLUX-VAE \| 🤗 [REPA-E/e2e-flux-vae](https://huggingface.co/REPA-E/e2e-flux-vae) \|
	\| E2E-SD-3.5-VAE \| 🤗 [REPA-E/e2e-sd3.5-vae](https://huggingface.co/REPA-E/e2e-sd3.5-vae) \|
	\| E2E-Qwen-Image-VAE \| 🤗 [REPA-E/e2e-qwenimage-vae](https://huggingface.co/REPA-E/e2e-qwenimage-vae) \|

	## 📦 Requirements
	The following packages are required to load and run the REPA-E VAEs with the `diffusers` library:

	```bash
	pip install diffusers>=0.35.0
	pip install torch>=2.5.0
	```

	## 🚀 Example Usage
	Below is a minimal example showing how to load and use the REPA-E end-to-end trained Qwen-Image-VAE with `diffusers`:

	```python
	from io import BytesIO
	import requests

	from diffusers import AutoencoderKLQwenImage
	import numpy as np
	import torch
	from PIL import Image


	response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
	device = "cuda"

	image = torch.from_numpy(
	np.array(
	Image.open(BytesIO(response.content))
	)
	).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
	image = image.to(device)

	vae = AutoencoderKLQwenImage.from_pretrained("REPA-E/e2e-qwenimage-vae").to(device)

	# QwenImage VAE expects an additional dimension for `num_frames`
	image_ = image.unsqueeze(2)

	with torch.no_grad():
	latents = vae.encode(image_).latent_dist.sample()
	reconstructed = vae.decode(latents).sample

	# Squeeze the extra frame dimension
	latents = latents.squeeze(2)
	reconstructed = reconstructed.squeeze(2)

	```