int4 version of gemma3 27b QAT model

by bugwei - opened May 5

May 5

I have a question regarding the int4 version of the gemma-3-27b-it-qat-unquantized model.
I've noticed that there are Flax versions of the int4 model available on Kaggle, as shown in the image you can see.

However, when I attempted to convert the gemma3-1b-it-int4 (flax) model to the safetensors format using the script provided here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/gemma3/convert_gemma3_weights_orbax_to_hf.py, I found that the results were not identical to the google/gemma-3-1b-it-qat-int4-unquantized model provided by Google on Hugging Face (I used 1b model for faster testing). Here's an example of the differences I encountered:

...
not same
Layer: model.layers.0.self_attn.o_proj.weight | Max diff: 0.012939 | Mean diff: 0.003159
dtype=torch.bfloat16, model.layers.0.self_attn.o_proj.weight, shape=torch.Size([1152, 1024])
dtype=torch.bfloat16, model.layers.0.self_attn.o_proj.weight, shape=torch.Size([1152, 1024])

same
dtype=torch.bfloat16, model.layers.0.self_attn.q_norm.weight, shape=torch.Size([256])
dtype=torch.bfloat16, model.layers.0.self_attn.q_norm.weight, shape=torch.Size([256])
...

I would like to inquire about the conversion process used to create the official int4 quantized models.
Additionally, I would be grateful if you could share any information regarding potential plans to officially release a gemma-3-27b-it-qat-int4-unquantized version in the future.

lkv

Google org May 6

Hi @bugwei ,

Thank you for giving this information , will raise with the internal team as feature request. will update you shortly. Thanks.

bugwei

May 6

Hi @lkv ,

Thank you for addressing this issue. I am looking forward to the release.
I’d like to follow up with a few additional questions.

I've noticed that on Kaggle, there are four variations of the Gemma model in the Flax framework.
Taking the 27B size as an example, these are: gemma3-27b, gemma3-27b-int4, gemma3-27b-it, and gemma3-27b-it-int4.
I would like to ask if the quantization method used for these models is indeed per-channel quantization, as I assumed.
If so, could you help me understand why my converted model differs from the official one, as seen in the 1b model example I provided earlier?
Furthermore, I've also noticed that the Gemma team performs QAT on the pretrained models.
I'd like to inquire whether models like google/gemma-3-xb-pt-qat-int4-unquantized will also be released in the future?

Thank you for your time and consideration in addressing these questions.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment