Possibility of replacing base pretrained models for inference

#2
by jing-yi - opened

MS-Diffusion's trainable adapters are built on SDXL and CLIP-G. They transform the CLIP image features into SDXL cross-attention tokens. A distilled SDXL can be used if it has the same cross-attention layers. However, since the output image features of CLIP-L and CLIP-G are different in shape, CLIP-G cannot be replaced by CLIP-L.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment