About finetuning current SDXL weights by the EQ-SDXL-VAE

by eeyrw - opened 15 days ago

15 days ago

•

You said in intro: "You can try to use this VAE to finetune your sdxl model and expect a better final result, but it may require lot of time to achieve it...". I am still very interesting in utilizing existed model weights. So my question is lot of how? I have ~500k samples and how many iterations are required to align the UNet of SDXL with new latent space?

KBlueLeaf

Owner 14 days ago

lot of training time.
ALTHOUGH some reported result is "few k step with a small lora works well"
Your setup is definitely ok

eeyrw

14 days ago

•

edited 14 days ago

I just thought dataset like LAION-400M needed. Finally it turns out in scale of kilo samples said to be working.

KBlueLeaf

Owner 12 days ago

I just thought dataset like LAION-400M needed. Finally it turns out in scale of kilo samples said to be working.

My thought is like danbooru (8M) or CC 12M
and yes, I'm also surprising that few k or just few dozen k is enough

eeyrw

12 days ago

•

edited 12 days ago

I spent a night to have a quick try by finetuning a lora about 48k iterations and get very poor result and I suspect that there is something wrong in finetuning process. Do I need modify my training script in aspect of VAE? Because I notice there are some parameters not used by oringinal VAE such as:

            "shift_factor": 0.8640247167934477,

In my training script VAE encoding part goes like this:

            model_input = vae.encode(pixel_values).latent_dist.sample()
            model_input = model_input * vae.config.scaling_factor 
            model_input = model_input.to(weight_dtype)

should I change the code into:

            model_input = model_input * vae.config.scaling_factor + vae.config.shift_factor

?
By the way, I use StableDiffusionXLPipeline from diffusers for inference.

KBlueLeaf

Owner 10 days ago

•

edited 10 days ago

I will strongly recommend you to:
Encode:

latent = vae.encode(pixel).latent_dist.sample()
std_latent = (latent - torch.tensor(vae.config.latents_mean)[None, :, None, None]) / torch.tensor(vae.config.latents_std)[None, :, None, None]
model_input = std_latent.to(weight_dtype)

Decode:

latent = model_output *  torch.tensor(vae.config.latents_std)[None, :, None, None] + torch.tensor(vae.config.latents_mean)[None, :, None, None]
pixel = vae.decode(latent).sample * 0.5 + 0.5

To utilize this in SDXL pipeline you may need some modification of the source code, if you don't want to do it, just finetune with "scale" + "shift" only which follow the pipeline impl
Your "should I change the code into"... is correct if you only want to modify the trainer code

eeyrw

9 days ago

•

edited 9 days ago

I checked sdxl pipeline of diffusers v0.32.2 https://github.com/huggingface/diffusers/blob/560fb5f4d65b8593c13e4be50a59b1fd9c2d9992/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py#L1268-L1281
and I found out the pipeline of this version has taken mean and std into consideration:

        # unscale/denormalize the latents
        # denormalize with the mean and std if available and not None
        has_latents_mean = hasattr(self.vae.config, "latents_mean") and self.vae.config.latents_mean is not None
        has_latents_std = hasattr(self.vae.config, "latents_std") and self.vae.config.latents_std is not None
        if has_latents_mean and has_latents_std:
            latents_mean = (
                torch.tensor(self.vae.config.latents_mean).view(1, 4, 1, 1).to(latents.device, latents.dtype)
            )
            latents_std = (
                torch.tensor(self.vae.config.latents_std).view(1, 4, 1, 1).to(latents.device, latents.dtype)
            )
            latents = latents * latents_std / self.vae.config.scaling_factor + latents_mean
        else:
            latents = latents / self.vae.config.scaling_factor

But the calculation still has obvious difference from the code you provided. Such as the code from pipeline still unscales the latents no matter latent mean and std used. So I trained with model_input = model_input * vae.config.scaling_factor and inferenced with latents = latents * latents_std / self.vae.config.scaling_factor + latents_mean, no wonder finally led poor result.

According my understanding the difference between model_input = model_input * vae.config.scaling_factor and latents = latents * latents_std / self.vae.config.scaling_factor + latents_mean is that the former normalization applied same mean and std ( although represented as scaling and shift ) on all channels of latent but latter one applied channel wise mean and std which is possibly better for normalization. Is that the purpose like I thought?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment