hardware specs for running RTX Pro 6000
Trying to get this running on RTX Pro 6000 with 96GB VRAM, getting OOM..
vllm serve NousResearch/Hermes-4-70B
Tried few other param combination:
--dtype auto --kv-cache-dtype fp8 --gpu-memory-utilization 0.95 --max-model-len 32768 etc.
error returned:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 94.97 GiB of which 206.88 MiB is free. Including non-PyTorch memory, this process has 94.76 GiB memory in use
. Of the allocated memory 94.12 GiB is allocated by PyTorch, and 1.80 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments
:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Is it possible to run Hermes 4 70B on RTX Pro 6000?
You should run the FP8 version with VLLM. in bf16, it requires 140GB of memory
That's available here and no special flags are required to run it (but you can still use all the ones you are)
https://huggingface.co/NousResearch/Hermes-4-70B-FP8
thank you so much; how much GPU is needed for the FP8 version? 48/64/96?
You would need around 35GB to load it + at least 5 more GB for the context window
err sorry, 70GB to load ~5GB for context*
If you use the 4bit GGUF, only 35GB but it reduces quality by a small bit