How many GPU Memory AWQ need?

by hermitg - opened 23 days ago

Discussion

hermitg

23 days ago

thanks your wonderful work

tclf90

QuantTrio org 22 days ago

That depends on how much context you would want it to support, as well as how many concurrent users you would like to serve.
Let's say if you want to test out only a handful of users, 4x64GB should be enough, roughly.

tacos4me

22 days ago

Thoughts on 2x RTX 6000 Blackwell? Different quant?

It does fit the layers, but VLLM dies after compilation no matter my settings.

Fernanda24

15 days ago

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x749bfa9785e8 in /home/giga/.pyenv/versions/3.12.11/envs/vllm10/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xe0 (0x749bfa90d4a2 in /home/giga/.pyenv/versions/3.12.11/envs/vllm10/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x749bfadc6422 in /home/giga/.pyenv/versions/3.12.11/envs/vllm10/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x749b8fee55a6 in /home/giga/.pyenv/versions/3.12.11/envs/vllm10/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x749b8fef5840 in /home/giga/.pyenv/versions/3.12.11/envs/vllm10/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x782 (0x749b8fef73d2 in /home/giga/.pyenv/versions/3.12.11/envs/vllm10/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x749b8fef8fdd in /home/giga/.pyenv/versions/3.12.11/envs/vllm10/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xf2324 (0x749b7fef2324 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0xa27f1 (0x749bfb8a27f1 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x133c9c (0x749bfb933c9c in /lib/x86_64-linux-gnu/libc.so.6)

Fernanda24

15 days ago

thing dies when prompted

Ununnilium

9 days ago

I think 5 48 GB GPUs are probably the minimum. I am able to run it like this with the public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:latest vLLM Docker image on 5 L40s:

OMP_NUM_THREADS=16 VLLM_PP_LAYER_PARTITION=20,18,18,18,18 vllm serve path/to/GLM-4.5-AWQ --max-model-len 22000 --max-seq-len-to-capture 22000 --enable-expert-parallel --swap-space 16 --tensor-parallel-size 1 --pipeline-parallel-size 5 --gpu-memory-utilization 0.99 --served-model-name glm-4.5 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice --kv-cache-dtype fp8 --calculate-kv-scales

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment