How much GPU memory is needed for local deployment?

by XuehangCang - opened 22 days ago

Discussion

XuehangCang

22 days ago

How much GPU memory is needed for local deployment?

jiyangwahaha

22 days ago

LPN64

22 days ago

+1

No i would say +160Gb at least not 1Gb

brownb2

22 days ago

Why 160GB? It's only got 3B params active, which seems to suggest this might even work well on a low end GPU, or even a CPU, with very fast storage.

mmolho-expert

21 days ago

Yes but you still need to load the entire model (80B parameters) in your GPU to run this model. At the inference time, only 3B parameters are active so you save a lot of GPU compute during inference, so you can handle more requests and process them faster than if the 80B parameters were activated. But still, you need to load the entire model in VRAM. So without quantization, this model is 160GB large yes, so you need at least (in practice you need more, depending on the context size you want to handle) 160GB VRAM.

CHNtentes

21 days ago

Yes but you still need to load the entire model (80B parameters) in your GPU to run this model. At the inference time, only 3B parameters are active so you save a lot of GPU compute during inference, so you can handle more requests and process them faster than if the 80B parameters were activated. But still, you need to load the entire model in VRAM. So without quantization, this model is 160GB large yes, so you need at least (in practice you need more, depending on the context size you want to handle) 160GB VRAM.

with gguf you can unload part of the moe experts to ram. this is slower of course, but requires less vram.

Yongzheng

21 days ago

•

edited 21 days ago

Yes, I used 2 x A800(H100) GPU using vllm with param --gpu-memory-utilization 0.96 or it may crash without startup. However, during inference, if the input is long, it still crashes.

export CUDA_VISIBLE_DEVICES=0,1
export NCCL_CUMEM_ENABLE=1
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
uv run vllm serve llm-models/Qwen3-Next-80B-A3B-Instruct --served-model-name Qwen3-Next-80B-A3B \
    --host 0.0.0.0 --port 10004 --tensor_parallel_size 2 --max_model_len 32768 \
    --gpu-memory-utilization 0.96 --enable-auto-tool-choice --tool-call-parser hermes --enable-log-requests \
    --max_num_seqs 1 --swap-space 64 --max_num_batched_tokens 4096

CamiloMM

20 days ago

If you quantize it far enough, you can even run it on a single 32GB card. But this is the sort of model for CPU users. It's 3B active parameters; even a relatively mediocre CPU would do. Some remarkable CPU like the Strix Halo (Ryzen Al Max+ 395) or a Mac's CPU would make this shine.

JerryShi2077

19 days ago

I use 2 H20 96G to server the model, length =32768, but when chat ,the output was !!!!

jodiedingdan

16 days ago

Yes, I used 2 x A800(H100) GPU using vllm with param --gpu-memory-utilization 0.96 or it may crash without startup. However, during inference, if the input is long, it still crashes.

export CUDA_VISIBLE_DEVICES=0,1
export NCCL_CUMEM_ENABLE=1
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
uv run vllm serve llm-models/Qwen3-Next-80B-A3B-Instruct --served-model-name Qwen3-Next-80B-A3B \
    --host 0.0.0.0 --port 10004 --tensor_parallel_size 2 --max_model_len 32768 \
    --gpu-memory-utilization 0.96 --enable-auto-tool-choice --tool-call-parser hermes --enable-log-requests \
    --max_num_seqs 1 --swap-space 64 --max_num_batched_tokens 4096

May I ask which version of vllm do you use?

nikhilfande

13 days ago

It would be great if qwen release official quantised version qwen3-next model such that user can fut into 2xH100 gpus and handle large context requests 😊

JerryShi2077

11 days ago

I use 2 H20 96G to server the model, length =32768, but when chat ,the output was !!!!

I had use the latest vllm-openai docker, docker pull vllm/vllm-openai:nightly, and the generation was normal.

JerryShi2077

11 days ago

How much GPU memory is needed for local deployment?

4 X H20 96G , the run command as bellow:
vllm serve /root/.cache/huggingface/Qwen3-Next-80B-A3B-Instruct
--tokenizer-mode auto
--tensor-parallel-size 4
--dtype bfloat16
--max-model-len 32768
--max-num-seqs 16
--max-num-batched-tokens 4096
--host 0.0.0.0 --port 8080
--served-model-name Qwen3-Next-80B-A3B-Instruct

mobo68

7 days ago

You can run the official FP8 version on a single H100. It takes around 76GB of VRAM just to load the model. Then, keep some memory for the KV cache (10GB of VRAM to process 2 queries with 128k context). Also, reserve at least 5GB of VRAM for extra overhead, or it will crash at startup. I had to reduce GPU memory utilization to 0.9 for it to work (0.95 causes a crash for me).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment