Serving on vLLM creates nonsense responses
Your GPU told you it doesn't support float16, and then you forced it to use float16, which is what half is (half the base float32).
From the vLLM documentation:
--dtype {auto,half,float16,bfloat16,float,float32}
Data type for model weights and activations.
“auto” will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.
“half” for FP16. Recommended for AWQ quantization.
“float16” is the same as “half”.
“bfloat16” for a balance between precision and range.
“float” is shorthand for FP32 precision.
“float32” for FP32 precision.
https://docs.vllm.ai/en/v0.4.0.post1/models/engine_args.html
okey but i dont understand why does llm model generates nonsense data
Hi @cahmetcan ,
Apologizes for the late response, as you mentioned if the GPU is not supporting the float16 data type and if you pass the --dtype=half
(which is nothing but float16) while running the code, the model weights might be corrupted because of data type compatibility issues. This leads to gibberish output. Please let me know if you required any further assistance.
Thanks.