AssertionError: You do not have CLIP state dict!

#2
by PixelClassisist - opened

I get the following error when trying to use this in Forge. Your text Detail improved HiT model works fine though. Any ideas?

Could you specify what you mean by "this" - which model exactly is not working for you? Make sure you use the version that worked with HiT; e.g. if you used the Text Encoder only that has TE-only in the filename for the HiT, then also try the TE-only version of ['this' model you were referring to].

Could you specify what you mean by "this" - which model exactly is not working for you? Make sure you use the version that worked with HiT; e.g. if you used the Text Encoder only that has TE-only in the filename for the HiT, then also try the TE-only version of ['this' model you were referring to].

Thanks for the reply. I'm referring to "Long-ViT-L-14-BEST-GmP-smooth-ft.safetensors". Currently I'm using "ViT-L-14-TEXT-detail-improved-hiT-GmP-TE-only-HF" and this one works fine, however, I often use very long prompts, so I thought the Long version might be better suited. In the files and versions tabs of "Long-ViT-L-14-BEST-GmP-smooth-ft.safetensors" I can't see a TE-only option. Am I missing something perhaps?

Oh, I am sorry about my confusion! /o
I just clicked this in "inbox" and failed to see we're discussing Long-CLIP, not "normal CLIP". Sorry about that!

You need to adjust (expand) the embeddings and "inject" the Long-CLIP model for that to work.
https://github.com/SeaArtLab/ComfyUI-Long-CLIP did so for SD, SDXL - while I contributed the Flux node via a pull request.

Unfortunately, I don't use Forge (or much inference at all; my art became tweaking the model itself, not so much generating images, haha!). But I hope the details for ComfyUI will serve as guidance for what you'd need to implement with Forge. Or to request the implementation into Forge with the authors of Forge / the community.

Hope that helps / is a starting point, at least!

Any solution for forge? All the models fail with "ValueError: Failed to recognize model type!"

UPD1 : having "CLIP" in the filename helps to load regular clip but longclip still has an error:
RuntimeError: Error(s) in loading state_dict for IntegratedCLIP: size mismatch for transformer.text_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([248, 768]) from checkpoint, the shape in current model is torch.Size([77, 768]).

UPD2: Editing this line to 248 helped!

@ceoofcapybaras - glad you figured it out already! In the long-term, I guess opening an Issue on the repo / asking for implementation of Long-CLIP in Forge would be the best option, so it's available to everybody (and not just to those willing to peek around and edit the code).

hi i am getting similar error while running flux_dev.safetensors on automatic1111 interface
here is screenshot
Anyone know how to fix this issue?
Also , i have been using new 50 series Nvidia GPU,
Screenshot 2025-07-23 082120.png

@Benni122 I don't think using Flux with a Stable Diffusion SDXL VAE looks right. :)
Try putting a CLIP there in VAE / Text Encoder? Though Flux typically uses two Text Encoders, CLIP and T5, CLIP only would work, too.

Sign up or log in to comment