Remove vLLM FP8 Limitation

by simon-mo - opened Apr 29

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

-23

simon-mo

Qwen org Apr 29

This has been fixed as of latest v0.8.5 release 🙇

Remove vLLM FP8 Limitation6fb80781

raghavgg

Apr 29

ERROR 04-29 09:46:24 [core.py:396] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")

i got this when running it on an A100..does it not use the marlin kernels by default?

jklj077 changed pull request status to merged Apr 30

HoagyD

May 1

I'm still encountering this error on 0.8.5.
I'm using 2 3090s with -tp 2 if that makes a difference?

GentlePickle

15 days ago

I'm also still encountering this issue on vLLM version 0.8.5.post1

Model: Qwen/Qwen3-30B-A3B-FP8

Running in WSL Ubuntu, 2x RTX 3090 gpus

Command:
vllm serve /mnt/d/models/Qwen3-30B-A3B-FP8
--enable-reasoning
--reasoning-parser deepseek_r1
--quantization fp8
--enforce-eager
--max-model-len 10000
--tensor-parallel-size 2
--gpu-memory-utilization .98
--served-model-name localmodel
--enable-auto-tool-choice
--tool-call-parser hermes
--port 5111

Error:
RuntimeError: Worker failed with error 'at 1:0:
def _per_token_group_quant_fp8(
^
ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")', please check the stack trace above for the root cause

raghavgg

15 days ago

This comment has been hidden

jklj077

Qwen org 14 days ago

this is w8a8, which needs hopper, ada lovelace, or later cards. I don't think 3090s (ampere) can run this.

HoagyD

14 days ago

•

edited 14 days ago

Vllm's marlin kernel should allow you to run fp8 models at w8a16 on ampere https://docs.vllm.ai/en/latest/features/quantization/fp8.html

jklj077

Qwen org 13 days ago

Vllm's marlin kernel should allow you to run fp8 models at w8a16 on ampere https://docs.vllm.ai/en/latest/features/quantization/fp8.html

FP8 Marlin doesn't support block-wise fp8 quant and MoE until https://github.com/vllm-project/vllm/pull/16850, which is not available in 0.8.5.post1. while 0.9.0 includes that PR, there isn't any prebuilt binary packages. so for now, ampere cards cannot run this.

HoagyD

13 days ago

Ah I see, thanks very much!

traphix

12 days ago

so, will vllm 0.9.0 support block-wise fp8 quant and MoE?

jklj077

Qwen org 6 days ago

so, will vllm 0.9.0 support block-wise fp8 quant and MoE?

yes, it does. if you have met any issues, please consider report to vLLM or Qwen at GitHub.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment