hardware specs for running RTX Pro 6000

#6
by anuragphadke - opened

Trying to get this running on RTX Pro 6000 with 96GB VRAM, getting OOM..

vllm serve NousResearch/Hermes-4-70B

Tried few other param combination:
--dtype auto --kv-cache-dtype fp8 --gpu-memory-utilization 0.95 --max-model-len 32768 etc.

error returned:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 94.97 GiB of which 206.88 MiB is free. Including non-PyTorch memory, this process has 94.76 GiB memory in use
. Of the allocated memory 94.12 GiB is allocated by PyTorch, and 1.80 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments
:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Is it possible to run Hermes 4 70B on RTX Pro 6000?

anuragphadke changed discussion title from hardware specs for running on consumer GPU to hardware specs for running RTX Pro 6000
NousResearch org

You should run the FP8 version with VLLM. in bf16, it requires 140GB of memory

teknium changed discussion status to closed
NousResearch org

That's available here and no special flags are required to run it (but you can still use all the ones you are)
https://huggingface.co/NousResearch/Hermes-4-70B-FP8

thank you so much; how much GPU is needed for the FP8 version? 48/64/96?

NousResearch org

You would need around 35GB to load it + at least 5 more GB for the context window

NousResearch org

err sorry, 70GB to load ~5GB for context*

NousResearch org

If you use the 4bit GGUF, only 35GB but it reduces quality by a small bit

Sign up or log in to comment