How many GPU Memory AWQ need?
thanks your wonderful work
That depends on how much context you would want it to support, as well as how many concurrent users you would like to serve.
Let's say if you want to test out only a handful of users, 4x64GB should be enough, roughly.
Thoughts on 2x RTX 6000 Blackwell? Different quant?
It does fit the layers, but VLLM dies after compilation no matter my settings.
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x749bfa9785e8 in /home/giga/.pyenv/versions/3.12.11/envs/vllm10/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xe0 (0x749bfa90d4a2 in /home/giga/.pyenv/versions/3.12.11/envs/vllm10/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x749bfadc6422 in /home/giga/.pyenv/versions/3.12.11/envs/vllm10/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x749b8fee55a6 in /home/giga/.pyenv/versions/3.12.11/envs/vllm10/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x749b8fef5840 in /home/giga/.pyenv/versions/3.12.11/envs/vllm10/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x782 (0x749b8fef73d2 in /home/giga/.pyenv/versions/3.12.11/envs/vllm10/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x749b8fef8fdd in /home/giga/.pyenv/versions/3.12.11/envs/vllm10/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xf2324 (0x749b7fef2324 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0xa27f1 (0x749bfb8a27f1 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x133c9c (0x749bfb933c9c in /lib/x86_64-linux-gnu/libc.so.6)
thing dies when prompted
I think 5 48 GB GPUs are probably the minimum. I am able to run it like this with the public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:latest
vLLM Docker image on 5 L40s:
OMP_NUM_THREADS=16 VLLM_PP_LAYER_PARTITION=20,18,18,18,18 vllm serve path/to/GLM-4.5-AWQ --max-model-len 22000 --max-seq-len-to-capture 22000 --enable-expert-parallel --swap-space 16 --tensor-parallel-size 1 --pipeline-parallel-size 5 --gpu-memory-utilization 0.99 --served-model-name glm-4.5 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice --kv-cache-dtype fp8 --calculate-kv-scales