google/gemma-3-1b-it · Serving on vLLM creates nonsense responses

Mar 16

used this code to run vllm. dtype is half because an error occured before saying that my gpu is not supporting float16.
!HUGGING_FACE_HUB_TOKEN=token vllm serve "google/gemma-3-1b-it" --dtype=half

as you see in the image it creates nonsense outputs

GuidoIotelli

Mar 16

Your GPU told you it doesn't support float16, and then you forced it to use float16, which is what half is (half the base float32).

From the vLLM documentation:

--dtype {auto,half,float16,bfloat16,float,float32}
Data type for model weights and activations.

“auto” will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.
“half” for FP16. Recommended for AWQ quantization.
“float16” is the same as “half”.
“bfloat16” for a balance between precision and range.
“float” is shorthand for FP32 precision.
“float32” for FP32 precision.

https://docs.vllm.ai/en/v0.4.0.post1/models/engine_args.html

cahmetcan

Mar 19

okey but i dont understand why does llm model generates nonsense data

BalakrishnaCh

Google org 8 days ago

Hi @cahmetcan ,

Apologizes for the late response, as you mentioned if the GPU is not supporting the float16 data type and if you pass the --dtype=half (which is nothing but float16) while running the code, the model weights might be corrupted because of data type compatibility issues. This leads to gibberish output. Please let me know if you required any further assistance.

Thanks.