RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
model = OVModelForVisualCausalLM.from_pretrained(model_id, export=False, device="GPU.0", ov_config=ov_config) # For GPU use "GPU.0"
According to the script you provided, it runs normally on the CPU, but when changed to GPU, the following error message appears:
/root/openvino-gemma3-env/lib/python3.12/site-packages/openvino/runtime/init.py:10: DeprecationWarning: The openvino.runtime
module is deprecated and will be removed in the 2026.0 release. Please replace openvino.runtime
with openvino
.
warnings.warn(
Loading model... this should get faster after the first generation due to caching behavior.
Using a slow image processor as use_fast
is unset and a slow processor was saved with this model. use_fast=True
will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False
.
Sum of image and text tokens: 272
/root/openvino-gemma3-env/lib/python3.12/site-packages/transformers/generation/utils.py:1947: UserWarning: This model does not support Cache
instances, it only supports the legacy cache format (tuple of tuples). cache_implementation
(set to hybrid) will be ignored.
warnings.warn(
Traceback (most recent call last):
File "/root/openvino-gemma3-env/gemma-3-4b-it-int4-gpu.py", line 43, in
output_ids = model.generate(**inputs, max_new_tokens=1024)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/openvino-gemma3-env/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/openvino-gemma3-env/lib/python3.12/site-packages/transformers/generation/utils.py", line 2465, in generate
result = self._sample(
^^^^^^^^^^^^^
File "/root/openvino-gemma3-env/lib/python3.12/site-packages/transformers/generation/utils.py", line 3476, in _sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/openvino-gemma3-env/lib/python3.12/site-packages/nncf/torch/dynamic_graph/wrappers.py", line 85, in wrapped
op1 = operator(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: probability tensor contains either inf
, nan
or element < 0
I actually just figured this out last night; do_sample must be set to false. Though that isn't the example. Perhaps try explicitly setting it to false;
output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
Lmk if that helps and thanks for the issue!
Gemma-3 exhibits numerical instability if you try to run inference at FP16. It requires at least BF16.
The reason it works on CPU is because the CPU uses FP32.
import openvino.runtime as ov
core = ov.Core()
print(core.get_property("CPU", "INFERENCE_PRECISION_HINT"))
<Type: 'float32'>
print(core.get_property("GPU", "INFERENCE_PRECISION_HINT"))
<Type: 'float16'>
See here for another example https://github.com/jukofyork/control-vectors/pull/4
The A770 supports BF16 so if you can expose INFERENCE_PRECISION_HINT via /load_model that would solve it properly.
I tried it last night.
tldr
I got the int8 model to run by setting INFERENCE_PRECISION_HINT to FP32 at 9t/s
Echo9Zulu/gemma-3-4b-it-int8_asym-ov
Then set this in the ovconfig:
"ov_config": {
"NUM_STREAMS": "1",
"INFERENCE_PRECISION_HINT": "fp32",
"PERFORMANCE_HINT": "LATENCY"
}
In the OpenArc codebase, I renamed PRECISION_HINT
to INFERENCE_PRECISION_HINT
(locally, didn't have time to craft a pull request)
When you set do_sample=False
, the model simply takes the argmax (most likely token) at each step, bypassing the multinomial sampling that's triggering the error.
The model is still producing invalid probabilities, it's just they're not being sampled with this setting* (deterministic output from the same input)
- though it's possible you'll hit one of them after a few messages.
If you're fine with that, it's a good work around for now.
other info
I couldn't get openvino to run this model Echo9Zulu/gemma-3-4b-it-int4_asym-ov at fp32 on the Arc A770. Error:
{
"detail": "Exception from src/inference/src/cpp/core.cpp:109:\nException from src/inference/src/dev/plugin.cpp:53:\nCheck 'false' failed at src/plugins/intel_gpu/src/plugin/program_builder.cpp:191:\n[GPU] ProgramBuilder build failed!\nCheck 'shape_type == shape_types::dynamic_shape || node->selected_impl != nullptr' failed at src/plugins/intel_gpu/src/graph/graph_optimizer/compile_graph.cpp:53:\n[GPU] Failed to select implementation for\nname:convert:Convert_224667\ntype: reorder\noriginal_type: ConvertCheck '!kernels.empty()' failed at src/plugins/intel_gpu/src/kernel_selector/kernel_selector.cpp:70:\n[GPU] Could not find a suitable kernel for convert:Convert_224667 params raw string: UINT4_BFYX_v1_p0_0_v128_p0_0_v20_p0_0_v10240_p0_0;F16_BFYX_v1_p0_0_v128_p0_0_v20_p0_0_v10240_p0_0\n\n\n\n\n"
}
Looks like none of the kernels can run int4 weights at FP32 on the GPU.
And I couldn't get the int8 model Echo9Zulu/gemma-3-4b-it-int8_asym-ov to run at bf16, only fp32 due to this limitation:
{
"detail": "Exception from src/inference/src/cpp/core.cpp:109:\nException from src/inference/src/dev/plugin.cpp:53:\nCheck 'property_validators.at(name)->is_valid(val)' failed at src/plugins/intel_gpu/src/runtime/execution_config.cpp:121:\n[GPU] Invalid value for property INFERENCE_PRECISION_HINT: `bf16`\n\n\n"
}
It looks like OpenVino hasn't implemented bf16 on the GPU yet? The hardware supports it so hopefully they'll add it soon.
@Echo9Zulu
After setting do_Sample=False, there is no result
@Gapeleon
The test results are the same~ int4 cannot execute
12B-INT8 Surprisingly, it occupies over 96GB of memory