nvidia/Llama-3.1-405B-Instruct-FP8 · KeyError when loading Llama-3.1-405B-Instruct-FP8 with VLLM v0.7.3

Mar 24

•

When attempting to load Llama-3.1-405B-Instruct-FP8 model using VLLM v0.7.3, all worker processes fail with the same KeyError:

KeyError: 'layers.7.mlp.down_proj.input_scale'

The error occurs during the model loading phase when the weight loader is trying to find a specific parameter that appears to be missing from the model weights or is named differently than expected.

Environment:

VLLM version: v0.7.3
Model: Llama-3.1-405B-Instruct-FP8
Using tensor parallelism across 8 GPUs
H200

Steps to reproduce:

Run VLLM server with Llama-3.1-405B-Instruct-FP8 model
Observe worker processes fail when trying to load the model weights

This appears to be a compatibility issue between the FP8 quantization format and the VLLM weight loading mechanism. The parameter naming or structure expected by VLLM doesn't match what's in the model weights.

Is there any additional parameter needed for FP8 quantized models, or is there a specific version of VLLM required for Llama-3.1 models with FP8 quantization?

xihajun

Mar 24

may need quantization=modelopt? instead of None

masamokkulu

Apr 15

I faced the same problem but quantization=modelopt worked. Thanks.

$ vllm serve nvidia/Llama-3.1-405B-Instruct-FP8 --quantization modelopt ...