How much GPU memory is needed for local deployment?
How much GPU memory is needed for local deployment?
+1
+1
No i would say +160Gb at least not 1Gb
Why 160GB? It's only got 3B params active, which seems to suggest this might even work well on a low end GPU, or even a CPU, with very fast storage.
Yes but you still need to load the entire model (80B parameters) in your GPU to run this model. At the inference time, only 3B parameters are active so you save a lot of GPU compute during inference, so you can handle more requests and process them faster than if the 80B parameters were activated. But still, you need to load the entire model in VRAM. So without quantization, this model is 160GB large yes, so you need at least (in practice you need more, depending on the context size you want to handle) 160GB VRAM.
Yes but you still need to load the entire model (80B parameters) in your GPU to run this model. At the inference time, only 3B parameters are active so you save a lot of GPU compute during inference, so you can handle more requests and process them faster than if the 80B parameters were activated. But still, you need to load the entire model in VRAM. So without quantization, this model is 160GB large yes, so you need at least (in practice you need more, depending on the context size you want to handle) 160GB VRAM.
with gguf you can unload part of the moe experts to ram. this is slower of course, but requires less vram.
Yes, I used 2 x A800(H100) GPU using vllm with param --gpu-memory-utilization 0.96
or it may crash without startup. However, during inference, if the input is long, it still crashes.
export CUDA_VISIBLE_DEVICES=0,1
export NCCL_CUMEM_ENABLE=1
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
uv run vllm serve llm-models/Qwen3-Next-80B-A3B-Instruct --served-model-name Qwen3-Next-80B-A3B \
--host 0.0.0.0 --port 10004 --tensor_parallel_size 2 --max_model_len 32768 \
--gpu-memory-utilization 0.96 --enable-auto-tool-choice --tool-call-parser hermes --enable-log-requests \
--max_num_seqs 1 --swap-space 64 --max_num_batched_tokens 4096
If you quantize it far enough, you can even run it on a single 32GB card. But this is the sort of model for CPU users. It's 3B active parameters; even a relatively mediocre CPU would do. Some remarkable CPU like the Strix Halo (Ryzen Al Max+ 395) or a Mac's CPU would make this shine.
Yes, I used 2 x A800(H100) GPU using vllm with param
--gpu-memory-utilization 0.96
or it may crash without startup. However, during inference, if the input is long, it still crashes.export CUDA_VISIBLE_DEVICES=0,1 export NCCL_CUMEM_ENABLE=1 export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 uv run vllm serve llm-models/Qwen3-Next-80B-A3B-Instruct --served-model-name Qwen3-Next-80B-A3B \ --host 0.0.0.0 --port 10004 --tensor_parallel_size 2 --max_model_len 32768 \ --gpu-memory-utilization 0.96 --enable-auto-tool-choice --tool-call-parser hermes --enable-log-requests \ --max_num_seqs 1 --swap-space 64 --max_num_batched_tokens 4096
May I ask which version of vllm do you use?
It would be great if qwen release official quantised version qwen3-next model such that user can fut into 2xH100 gpus and handle large context requests π
How much GPU memory is needed for local deployment?
4 X H20 96G , the run command as bellow:
vllm serve /root/.cache/huggingface/Qwen3-Next-80B-A3B-Instruct
--tokenizer-mode auto
--tensor-parallel-size 4
--dtype bfloat16
--max-model-len 32768
--max-num-seqs 16
--max-num-batched-tokens 4096
--host 0.0.0.0 --port 8080
--served-model-name Qwen3-Next-80B-A3B-Instruct
You can run the official FP8 version on a single H100. It takes around 76GB of VRAM just to load the model. Then, keep some memory for the KV cache (10GB of VRAM to process 2 queries with 128k context). Also, reserve at least 5GB of VRAM for extra overhead, or it will crash at startup. I had to reduce GPU memory utilization to 0.9 for it to work (0.95 causes a crash for me).