Qwen
/

Text Generation
Transformers
Safetensors
qwen3_moe
conversational
fp8

Remove vLLM FP8 Limitation

#2
Qwen org

This has been fixed as of latest v0.8.5 release πŸ™‡

ERROR 04-29 09:46:24 [core.py:396] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")

i got this when running it on an A100..does it not use the marlin kernels by default?

jklj077 changed pull request status to merged

I'm still encountering this error on 0.8.5.
I'm using 2 3090s with -tp 2 if that makes a difference?

I'm also still encountering this issue on vLLM version 0.8.5.post1

Model: Qwen/Qwen3-30B-A3B-FP8

Running in WSL Ubuntu, 2x RTX 3090 gpus

Command:
vllm serve /mnt/d/models/Qwen3-30B-A3B-FP8
--enable-reasoning
--reasoning-parser deepseek_r1
--quantization fp8
--enforce-eager
--max-model-len 10000
--tensor-parallel-size 2
--gpu-memory-utilization .98
--served-model-name localmodel
--enable-auto-tool-choice
--tool-call-parser hermes
--port 5111

Error:
RuntimeError: Worker failed with error 'at 1:0:
def _per_token_group_quant_fp8(
^
ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")', please check the stack trace above for the root cause

This comment has been hidden
Qwen org

this is w8a8, which needs hopper, ada lovelace, or later cards. I don't think 3090s (ampere) can run this.

Vllm's marlin kernel should allow you to run fp8 models at w8a16 on ampere https://docs.vllm.ai/en/latest/features/quantization/fp8.html

Qwen org

Vllm's marlin kernel should allow you to run fp8 models at w8a16 on ampere https://docs.vllm.ai/en/latest/features/quantization/fp8.html

FP8 Marlin doesn't support block-wise fp8 quant and MoE until https://github.com/vllm-project/vllm/pull/16850, which is not available in 0.8.5.post1. while 0.9.0 includes that PR, there isn't any prebuilt binary packages. so for now, ampere cards cannot run this.

Ah I see, thanks very much!

so, will vllm 0.9.0 support block-wise fp8 quant and MoE?

Qwen org

so, will vllm 0.9.0 support block-wise fp8 quant and MoE?

yes, it does. if you have met any issues, please consider report to vLLM or Qwen at GitHub.

Sign up or log in to comment