SigLIP or SigLIP2 encoder?
SigLIP or SigLIP2 encoder?
Yes, but which SigLIP checkpoint did Gemma3 use? SigLIP2 or SigLIP?
Thank you!
Orr
@orrzohar
Yes, SigLIP 2 utilizes a similar encoder architecture to SigLIP. In Gemma 3, they used a 400M-parameter variant of the SigLIP vision encoder.
SigLIP-So400m
@prithivMLmods
Trust me, I am familiar with SigLIP and SigLIP2. Both have shaped-optimized model variants. I know.
I just want to know WHICH was used.
Are you from Google org (there is no Google tag, and I already checked the technical report and all model configs, and you can't tell from those)?
Do you know that they used SigLIP-SO400M and not SigLIP2-SO-400M?
Thanks
Orr
No, I'm not from Google Org. I just read your discussion, so I responded.
I also analyzed the technical report, but I didn’t see anything about it.
I remember reading in an article, possibly from Gradient Flow about Gemma 3 (March, mid). It clearly mentioned that they used a 400M-parameter variant of the SigLIP vision encoder.
@orrzohar
Edit :
Yeah, this newsletter!
https://gradientflow.com/gemma-3-what-you-need-to-know/?utm_source=chatgpt.com
I’ve had this question since the release. Rather than guessing, let's ask the organization directly once again.
Hi @GopiUppari , can you please tell me exactly which vision encoder (SigLIP or SigLIP2) is used in the Gemma 3 family of models? Is it the SigLIP-SO400M?
Thankyou !
Prithiv
Hi @prithivMLmods , Sorry for late response, There are indications and discussions on Hugging Face that it's a 400M-parameter variant. This aligns with models like google/siglip-so400m-patch14-384, where "So400m" refers to "Shape-optimized, 400 million parameters. Kindly refer this link for more information.
Thank you.
Since Google staff here seems to be absolutely clueless about what's going on here, I'll provide a response with some clues. Gemma 3 is potentially utilizing SigLip V1.
Proof:
Line 816 in Gemma 3 instantiates the vision tower via AutoModel based on the Gemma3Config's configuration:
https://github.com/huggingface/transformers/blob/9bec2654ed5b4ac43e880dc7e3cb2c18aeae70a9/src/transformers/models/gemma3/modeling_gemma3.py#L816
For Gemma 3 27b IT, the configuration is the following:
"vision_config": {
"hidden_size": 1152,
"image_size": 896,
"intermediate_size": 4304,
"model_type": "siglip_vision_model",
"num_attention_heads": 16,
"num_hidden_layers": 27,
"patch_size": 14,
"vision_use_head": false
}
You can clearly see the model type is siglip_vision_model
, which is what is utilized for SigLip V1.
Line 164 in SigLip V1's SiglipVisionConfig
code defines it as siglip_vision_model
https://github.com/huggingface/transformers/blob/9bec2654ed5b4ac43e880dc7e3cb2c18aeae70a9/src/transformers/models/siglip/configuration_siglip.py#L164C19-L164C38
Meanwhile, for v2 the model type is siglip2_vision_model
in Siglip2VisionConfig
:
Line 172:
https://github.com/huggingface/transformers/blob/9bec2654ed5b4ac43e880dc7e3cb2c18aeae70a9/src/transformers/models/siglip2/configuration_siglip2.py#L172C19-L172C39
Now here is where it becomes tricky:
SigLip2 does not seem to be utilized anywhere because SigLip 2 also for some reason calls the SigLip architecture.
https://huggingface.co/google/siglip2-base-patch16-256/blob/main/config.json
Assuming SigLip2 was done by the same people who did SigLip1 (same names on both papers) we can assume the development to have taken ~1.1 years. Meanwhile, assuming Gemma 3's development took ~6 months (considering gemma 2's release) as well broken down into:
- Dataset creation (2 months)
- Hyperparameter searches (1 month)
- Actual pretraining/post training/rlhf (2 months, considering christmas holidays, potential problems, etc)
- Approvals/integrations (1 month)
Unless SigLip 2 was already a fully finished thing by October 2024 to freeze it's weights, it's unlikely that it would have been used for Gemma 3 Pretraining
i hope this helps :)
Hi @prithivMLmods , Sorry for late response, There are indications and discussions on Hugging Face that it's a 400M-parameter variant. This aligns with models like google/siglip-so400m-patch14-384, where "So400m" refers to "Shape-optimized, 400 million parameters. Kindly refer this link for more information.
Thank you.
@lkv No issues, thanks for your reply. btw, I got clarity after a few days of discussion.
Since Google staff here seems to be absolutely clueless about what's going on here, I'll provide a response with some clues. Gemma 3 is potentially utilizing SigLip V1.
Proof:
Line 816 in Gemma 3 instantiates the vision tower via AutoModel based on the Gemma3Config's configuration:
https://github.com/huggingface/transformers/blob/9bec2654ed5b4ac43e880dc7e3cb2c18aeae70a9/src/transformers/models/gemma3/modeling_gemma3.py#L816For Gemma 3 27b IT, the configuration is the following:
"vision_config": { "hidden_size": 1152, "image_size": 896, "intermediate_size": 4304, "model_type": "siglip_vision_model", "num_attention_heads": 16, "num_hidden_layers": 27, "patch_size": 14, "vision_use_head": false }
You can clearly see the model type is
siglip_vision_model
, which is what is utilized for SigLip V1.
Line 164 in SigLip V1'sSiglipVisionConfig
code defines it assiglip_vision_model
https://github.com/huggingface/transformers/blob/9bec2654ed5b4ac43e880dc7e3cb2c18aeae70a9/src/transformers/models/siglip/configuration_siglip.py#L164C19-L164C38Meanwhile, for v2 the model type is
siglip2_vision_model
inSiglip2VisionConfig
:
Line 172:
https://github.com/huggingface/transformers/blob/9bec2654ed5b4ac43e880dc7e3cb2c18aeae70a9/src/transformers/models/siglip2/configuration_siglip2.py#L172C19-L172C39Now here is where it becomes tricky:
SigLip2 does not seem to be utilized anywhere because SigLip 2 also for some reason calls the SigLip architecture.
https://huggingface.co/google/siglip2-base-patch16-256/blob/main/config.jsonAssuming SigLip2 was done by the same people who did SigLip1 (same names on both papers) we can assume the development to have taken ~1.1 years. Meanwhile, assuming Gemma 3's development took ~6 months (considering gemma 2's release) as well broken down into:
- Dataset creation (2 months)
- Hyperparameter searches (1 month)
- Actual pretraining/post training/rlhf (2 months, considering christmas holidays, potential problems, etc)
- Approvals/integrations (1 month)
Unless SigLip 2 was already a fully finished thing by October 2024 to freeze it's weights, it's unlikely that it would have been used for Gemma 3 Pretraining
@shadowlilac Thanks for your thoughts and explanation.
So, the thing is, SigLIP 2 doesn’t always use its own 'siglip2_vision_model' type because its architecture is intentionally backward compatible with SigLIP 1. Using the same model_type allows for direct weight and tokenizer swaps, ensuring seamless upgrades and integration into existing pipelines. The presence of a new model type in the codebase is for extensibility, but the default remains 'siglip_vision_model' to maximize compatibility.