Remove vLLM FP8 Limitation
This has been fixed as of latest v0.8.5 release π
ERROR 04-29 09:46:24 [core.py:396] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
i got this when running it on an A100..does it not use the marlin kernels by default?
I'm still encountering this error on 0.8.5.
I'm using 2 3090s with -tp 2 if that makes a difference?
I'm also still encountering this issue on vLLM version 0.8.5.post1
Model: Qwen/Qwen3-30B-A3B-FP8
Running in WSL Ubuntu, 2x RTX 3090 gpus
Command:
vllm serve /mnt/d/models/Qwen3-30B-A3B-FP8
--enable-reasoning
--reasoning-parser deepseek_r1
--quantization fp8
--enforce-eager
--max-model-len 10000
--tensor-parallel-size 2
--gpu-memory-utilization .98
--served-model-name localmodel
--enable-auto-tool-choice
--tool-call-parser hermes
--port 5111
Error:
RuntimeError: Worker failed with error 'at 1:0:
def _per_token_group_quant_fp8(
^
ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")', please check the stack trace above for the root cause
this is w8a8, which needs hopper, ada lovelace, or later cards. I don't think 3090s (ampere) can run this.
Vllm's marlin kernel should allow you to run fp8 models at w8a16 on ampere https://docs.vllm.ai/en/latest/features/quantization/fp8.html
Vllm's marlin kernel should allow you to run fp8 models at w8a16 on ampere https://docs.vllm.ai/en/latest/features/quantization/fp8.html
FP8 Marlin doesn't support block-wise fp8 quant and MoE until https://github.com/vllm-project/vllm/pull/16850, which is not available in 0.8.5.post1. while 0.9.0 includes that PR, there isn't any prebuilt binary packages. so for now, ampere cards cannot run this.
Ah I see, thanks very much!
so, will vllm 0.9.0 support block-wise fp8 quant and MoE?
so, will vllm 0.9.0 support block-wise fp8 quant and MoE?
yes, it does. if you have met any issues, please consider report to vLLM or Qwen at GitHub.