Models?
Does this only work for flux?
You can use it for anything that uses a "CLIP-L" Text Encoder. Stable Diffusion (any up to SD3), HunyuanVideo, ... :)
You can use it for anything that uses a "CLIP-L" Text Encoder. Stable Diffusion (any up to SD3), HunyuanVideo, ... :)
can confirm it works in newer models as well. and comfy supports long clip natively now so it's just a quick dropdown change for flux/chroma types
They literally just took a CLIP ViT-L/14 and "upscaled" the Vision Transformer (interpolate positional embeddings, same patch size -> more patch tokens (longer sequence)), left the ViT-L/14 Text Encoder just as-is, and fine-tuned to allow the model to adjust to the altered Vision Transformer -> that's CLIP ViT-L/14@336. So it is literally a ViT-L/14 / CLIP-L Text Encoder, in terms of the architecture. No difference.