Gemma3-12B-IT breaks due to attention error in 4bit

#21

by sleeping4cat - opened May 1

May 1

I get this error

packages/transformers/integrations/sdpa_attention.py", line 54, in sdpa_attention_forward
    attn_output = torch.nn.functional.scaled_dot_product_attention(
RuntimeError: p.attn_bias_ptr is not correctly aligned

Gemma3 is poorly done for 4bit and this sucks a lot

Redasus

May 14

Same here. Have you found a solution?

sleeping4cat

May 16

@Redasus I am going to upload my own quantised version of Gemma3 and for latter, I realised flash-attention didn't give me problem in bfloat16

BalakrishnaCh

Google org May 22

Hi @sleeping4cat ,

I have done the 4-bit quantization for the google/gemma-3-12b-it model, it's working perfectly fine for me and producing the responses as well for the given prompts. Could you please refer the following gist file. Please let me know if you require any further assistance.

Thanks.

sleeping4cat

May 22

@BalakrishnaCh thanks but I think its something releated to bitsandbytes. I have quantised the model in GGUF in 2-bit version and uploaded it. The problem I encountered is coming for some specific prompts/inputs which is weird in general. https://huggingface.co/sleeping-ai/Gemma3-12B-IT-TQ2-0

BalakrishnaCh

Google org May 23

@sleeping4cat If the issue is resolved please feel free to close the issue, if not please let us know if you are still facing the issue while doing the quantization process or any issues with the prompting with additional details to assist you further.

Thanks.

sleeping4cat changed discussion status to closed May 23

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment