Gemma3n not working on H20 with bfloat16 data type.

#30
by NOWSHAD - opened

I tried running Gemma3n model on H20 card with bfloat16 data type and it throws floating point exception and fails. Similarly I tried float32 data type and it works.
The example I'm trying is the same one present in the sample notebook.
On the other hand if I use the same example from the model card, bfloat16 fails with floating point exception (same as earlier) but flaot32 fails with below error

Unsupported: call_method GetAttrVariable(UserDefinedObjectVariable(AttentionInterface), _global_mapping) __getitem__ (ConstantVariable(),) {}

I tried the same in H100 and it's working as expected.

Anyone else faced this issue?

Hi @NOWSHAD ,

Welcome to Google Gemma family of open source models, the above issues might be because of the compatibility of floating point operation on a particular hardware. Please find the following suggestion to avoid such kind of issues:

Update Everything: Ensure your NVIDIA drivers, CUDA toolkit, cuDNN, PyTorch, transformers, and bitsandbytes libraries are all at their absolute latest stable versions. This is the most common fix.
Check Compatibility Matrix: Verify if your specific driver and CUDA versions are officially recommended for PyTorch on Hopper GPUs.

How the model is loaded (from_pretrained arguments), model.generate() is called (e.g., do_sample=False vs. True, specific generation parameters).

Thanks.

I tried running Gemma3n model on H20 card with bfloat16 data type and it throws floating point exception and fails. Similarly I tried float32 data type and it works.
The example I'm trying is the same one present in the sample notebook.
On the other hand if I use the same example from the model card, bfloat16 fails with floating point exception (same as earlier) but flaot32 fails with below error

Unsupported: call_method GetAttrVariable(UserDefinedObjectVariable(AttentionInterface), _global_mapping) __getitem__ (ConstantVariable(),) {}

I tried the same in H100 and it's working as expected.

Anyone else faced this issue?

For core dump with H20, check this: https://github.com/vllm-project/vllm/issues/4392#issuecomment-2227935528

Thank you @BalakrishnaCh @CHNtentes for your inputs. Yes, after playing around with the generation config and updating the packages resolved the issue.

NOWSHAD changed discussion status to closed

Sign up or log in to comment