SigLIP or SigLIP2 encoder?

#48
by orrzohar - opened

SigLIP or SigLIP2 encoder?

Google org

Hi @orrzohar ,

Yes, SigLIP and SigLIP 2 utilize similar encoder architectures, both employing the Vision Transformer (ViT) design with learned positional embeddings.
Could you please refer this reference.

Thank you.

Yes, but which SigLIP checkpoint did Gemma3 use? SigLIP2 or SigLIP?
Thank you!
Orr

@orrzohar : A or B?
@GopiUppari : Yes, ...
🤣

@orrzohar
Yes, SigLIP 2 utilizes a similar encoder architecture to SigLIP. In Gemma 3, they used a 400M-parameter variant of the SigLIP vision encoder.
SigLIP-So400m

@prithivMLmods
Trust me, I am familiar with SigLIP and SigLIP2. Both have shaped-optimized model variants. I know.
I just want to know WHICH was used.
Are you from Google org (there is no Google tag, and I already checked the technical report and all model configs, and you can't tell from those)?
Do you know that they used SigLIP-SO400M and not SigLIP2-SO-400M?

Thanks
Orr

No, I'm not from Google Org. I just read your discussion, so I responded.
I also analyzed the technical report, but I didn’t see anything about it.
I remember reading in an article, possibly from Gradient Flow about Gemma 3 (March, mid). It clearly mentioned that they used a 400M-parameter variant of the SigLIP vision encoder.
@orrzohar

Edit :

Yeah, this newsletter!
https://gradientflow.com/gemma-3-what-you-need-to-know/?utm_source=chatgpt.com

I’ve had this question since the release. Rather than guessing, let's ask the organization directly once again.

Hi @GopiUppari , can you please tell me exactly which vision encoder (SigLIP or SigLIP2) is used in the Gemma 3 family of models? Is it the SigLIP-SO400M?

Thankyou !
Prithiv

Hi @prithivMLmods , Sorry for late response, There are indications and discussions on Hugging Face that it's a 400M-parameter variant. This aligns with models like google/siglip-so400m-patch14-384, where "So400m" refers to "Shape-optimized, 400 million parameters. Kindly refer this link for more information.

Thank you.

Since Google staff here seems to be absolutely clueless about what's going on here, I'll provide a response with some clues. Gemma 3 is potentially utilizing SigLip V1.
Proof:
Line 816 in Gemma 3 instantiates the vision tower via AutoModel based on the Gemma3Config's configuration:
https://github.com/huggingface/transformers/blob/9bec2654ed5b4ac43e880dc7e3cb2c18aeae70a9/src/transformers/models/gemma3/modeling_gemma3.py#L816

For Gemma 3 27b IT, the configuration is the following:

"vision_config": {
    "hidden_size": 1152,
    "image_size": 896,
    "intermediate_size": 4304,
    "model_type": "siglip_vision_model",
    "num_attention_heads": 16,
    "num_hidden_layers": 27,
    "patch_size": 14,
    "vision_use_head": false
  }

You can clearly see the model type is siglip_vision_model, which is what is utilized for SigLip V1.
Line 164 in SigLip V1's SiglipVisionConfig code defines it as siglip_vision_model
https://github.com/huggingface/transformers/blob/9bec2654ed5b4ac43e880dc7e3cb2c18aeae70a9/src/transformers/models/siglip/configuration_siglip.py#L164C19-L164C38

Meanwhile, for v2 the model type is siglip2_vision_model in Siglip2VisionConfig:
Line 172:
https://github.com/huggingface/transformers/blob/9bec2654ed5b4ac43e880dc7e3cb2c18aeae70a9/src/transformers/models/siglip2/configuration_siglip2.py#L172C19-L172C39

Now here is where it becomes tricky:
SigLip2 does not seem to be utilized anywhere because SigLip 2 also for some reason calls the SigLip architecture.
https://huggingface.co/google/siglip2-base-patch16-256/blob/main/config.json

Assuming SigLip2 was done by the same people who did SigLip1 (same names on both papers) we can assume the development to have taken ~1.1 years. Meanwhile, assuming Gemma 3's development took ~6 months (considering gemma 2's release) as well broken down into:

  1. Dataset creation (2 months)
  2. Hyperparameter searches (1 month)
  3. Actual pretraining/post training/rlhf (2 months, considering christmas holidays, potential problems, etc)
  4. Approvals/integrations (1 month)

Unless SigLip 2 was already a fully finished thing by October 2024 to freeze it's weights, it's unlikely that it would have been used for Gemma 3 Pretraining

i hope this helps :)

Hi @prithivMLmods , Sorry for late response, There are indications and discussions on Hugging Face that it's a 400M-parameter variant. This aligns with models like google/siglip-so400m-patch14-384, where "So400m" refers to "Shape-optimized, 400 million parameters. Kindly refer this link for more information.

Thank you.

@lkv No issues, thanks for your reply. btw, I got clarity after a few days of discussion.

Since Google staff here seems to be absolutely clueless about what's going on here, I'll provide a response with some clues. Gemma 3 is potentially utilizing SigLip V1.
Proof:
Line 816 in Gemma 3 instantiates the vision tower via AutoModel based on the Gemma3Config's configuration:
https://github.com/huggingface/transformers/blob/9bec2654ed5b4ac43e880dc7e3cb2c18aeae70a9/src/transformers/models/gemma3/modeling_gemma3.py#L816

For Gemma 3 27b IT, the configuration is the following:

"vision_config": {
    "hidden_size": 1152,
    "image_size": 896,
    "intermediate_size": 4304,
    "model_type": "siglip_vision_model",
    "num_attention_heads": 16,
    "num_hidden_layers": 27,
    "patch_size": 14,
    "vision_use_head": false
  }

You can clearly see the model type is siglip_vision_model, which is what is utilized for SigLip V1.
Line 164 in SigLip V1's SiglipVisionConfig code defines it as siglip_vision_model
https://github.com/huggingface/transformers/blob/9bec2654ed5b4ac43e880dc7e3cb2c18aeae70a9/src/transformers/models/siglip/configuration_siglip.py#L164C19-L164C38

Meanwhile, for v2 the model type is siglip2_vision_model in Siglip2VisionConfig:
Line 172:
https://github.com/huggingface/transformers/blob/9bec2654ed5b4ac43e880dc7e3cb2c18aeae70a9/src/transformers/models/siglip2/configuration_siglip2.py#L172C19-L172C39

Now here is where it becomes tricky:
SigLip2 does not seem to be utilized anywhere because SigLip 2 also for some reason calls the SigLip architecture.
https://huggingface.co/google/siglip2-base-patch16-256/blob/main/config.json

Assuming SigLip2 was done by the same people who did SigLip1 (same names on both papers) we can assume the development to have taken ~1.1 years. Meanwhile, assuming Gemma 3's development took ~6 months (considering gemma 2's release) as well broken down into:

  1. Dataset creation (2 months)
  2. Hyperparameter searches (1 month)
  3. Actual pretraining/post training/rlhf (2 months, considering christmas holidays, potential problems, etc)
  4. Approvals/integrations (1 month)

Unless SigLip 2 was already a fully finished thing by October 2024 to freeze it's weights, it's unlikely that it would have been used for Gemma 3 Pretraining

@shadowlilac Thanks for your thoughts and explanation.

So, the thing is, SigLIP 2 doesn’t always use its own 'siglip2_vision_model' type because its architecture is intentionally backward compatible with SigLIP 1. Using the same model_type allows for direct weight and tokenizer swaps, ensuring seamless upgrades and integration into existing pipelines. The presence of a new model type in the codebase is for extensibility, but the default remains 'siglip_vision_model' to maximize compatibility.

Sign up or log in to comment