KeyError when loading Llama-3.1-405B-Instruct-FP8 with VLLM v0.7.3

#2
by xihajun - opened

When attempting to load Llama-3.1-405B-Instruct-FP8 model using VLLM v0.7.3, all worker processes fail with the same KeyError:

KeyError: 'layers.7.mlp.down_proj.input_scale'

The error occurs during the model loading phase when the weight loader is trying to find a specific parameter that appears to be missing from the model weights or is named differently than expected.

Environment:

  • VLLM version: v0.7.3
  • Model: Llama-3.1-405B-Instruct-FP8
  • Using tensor parallelism across 8 GPUs
  • H200

Steps to reproduce:

  1. Run VLLM server with Llama-3.1-405B-Instruct-FP8 model
  2. Observe worker processes fail when trying to load the model weights

This appears to be a compatibility issue between the FP8 quantization format and the VLLM weight loading mechanism. The parameter naming or structure expected by VLLM doesn't match what's in the model weights.

Is there any additional parameter needed for FP8 quantized models, or is there a specific version of VLLM required for Llama-3.1 models with FP8 quantization?

may need quantization=modelopt? instead of None

I faced the same problem but quantization=modelopt worked. Thanks.

$ vllm serve nvidia/Llama-3.1-405B-Instruct-FP8 --quantization modelopt ...

Sign up or log in to comment