Models?

by Kutches - opened Jul 25

Discussion

Kutches

Jul 25

Does this only work for flux?

zer0int

Owner Jul 25

You can use it for anything that uses a "CLIP-L" Text Encoder. Stable Diffusion (any up to SD3), HunyuanVideo, ... :)

marduk191

Jul 25

You can use it for anything that uses a "CLIP-L" Text Encoder. Stable Diffusion (any up to SD3), HunyuanVideo, ... :)

can confirm it works in newer models as well. and comfy supports long clip natively now so it's just a quick dropdown change for flux/chroma types

zer0int

Owner Jul 25

They literally just took a CLIP ViT-L/14 and "upscaled" the Vision Transformer (interpolate positional embeddings, same patch size -> more patch tokens (longer sequence)), left the ViT-L/14 Text Encoder just as-is, and fine-tuned to allow the model to adjust to the altered Vision Transformer -> that's CLIP ViT-L/14@336. So it is literally a ViT-L/14 / CLIP-L Text Encoder, in terms of the architecture. No difference.

See: An Image is Worth 16x16 Words -- 3.2

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment