VLLM not loading meta-llama/Llama-4-Scout-17B-16E-Instruct
meta-llama/Llama-4-Scout-17B-16E-Instruct
Above model is not loading correctly, however, meta-llama/Llama-4-Scout-17B-16E, works fine.
See the error logs below:
The core exact errors from the log are:
Assertion '-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.
RuntimeError: CUDA error: device-side assert triggered
torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:133, unhandled cuda error
torch._dynamo.exc.InternalTorchDynamoError: RuntimeError: CUDA error: device-side assert triggered
These errors indicate an index out of bounds problem during tensor operations, which cascaded into CUDA device errors and ultimately crashed the distributed processing system.
I was able to load and do infer with following command in vllm.
VLLM_DISABLE_COMPILE_CACHE=1 python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct \
--host 0.0.0.0 --port 8085 --max-model-len 8192 --dtype bfloat16 --tensor-parallel-size 8 --gpu-memory-utilization 0.9 --served-model-name llama4_scout_inst --override-generation-config='{"attn_temperature_tuning": true}'
My configs and hardware is in https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct/discussions/57
Huggingface load is failing so far for me.