int4 version of gemma3 27b QAT model

#3
by bugwei - opened

Hi @osanseviero ,

I have a question regarding the int4 version of the gemma-3-27b-it-qat-unquantized model.
I've noticed that there are Flax versions of the int4 model available on Kaggle, as shown in the image you can see.
image.png

However, when I attempted to convert the gemma3-1b-it-int4 (flax) model to the safetensors format using the script provided here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/gemma3/convert_gemma3_weights_orbax_to_hf.py, I found that the results were not identical to the google/gemma-3-1b-it-qat-int4-unquantized model provided by Google on Hugging Face (I used 1b model for faster testing). Here's an example of the differences I encountered:

...
not same
Layer: model.layers.0.self_attn.o_proj.weight | Max diff: 0.012939 | Mean diff: 0.003159
dtype=torch.bfloat16, model.layers.0.self_attn.o_proj.weight, shape=torch.Size([1152, 1024])
dtype=torch.bfloat16, model.layers.0.self_attn.o_proj.weight, shape=torch.Size([1152, 1024])

same
dtype=torch.bfloat16, model.layers.0.self_attn.q_norm.weight, shape=torch.Size([256])
dtype=torch.bfloat16, model.layers.0.self_attn.q_norm.weight, shape=torch.Size([256])
...

I would like to inquire about the conversion process used to create the official int4 quantized models.
Additionally, I would be grateful if you could share any information regarding potential plans to officially release a gemma-3-27b-it-qat-int4-unquantized version in the future.

Google org

Hi @bugwei ,

Thank you for giving this information , will raise with the internal team as feature request. will update you shortly. Thanks.

Hi @lkv ,

Thank you for addressing this issue. I am looking forward to the release.
I’d like to follow up with a few additional questions.

I've noticed that on Kaggle, there are four variations of the Gemma model in the Flax framework.
Taking the 27B size as an example, these are: gemma3-27b, gemma3-27b-int4, gemma3-27b-it, and gemma3-27b-it-int4.
I would like to ask if the quantization method used for these models is indeed per-channel quantization, as I assumed.
If so, could you help me understand why my converted model differs from the official one, as seen in the 1b model example I provided earlier?
Furthermore, I've also noticed that the Gemma team performs QAT on the pretrained models.
I'd like to inquire whether models like google/gemma-3-xb-pt-qat-int4-unquantized will also be released in the future?

Thank you for your time and consideration in addressing these questions.

Sign up or log in to comment