Why the model is even slower than QWQ-32B-AWQ
I am hosting the model with VLLM in docker in an AWS g6.12xlarge cluster with commend:
"command": [
"--model",
"Qwen/Qwen3-30B-A3B-FP8",
"--gpu-memory-utilization",
"0.95",
"--max-model-len",
"32000",
"--num-scheduler-steps",
"10",
"--quantization",
"fp8",
"--enforce-eager",
"--enable-expert-parallel",
"--tensor-parallel-size",
"4"
],
I noticed the model's average generation throughput is 13 tokens/s, however the QWQ-32B-AWQ model which is hosted in the same type of instance got 25 tokens/s. I think theoretically the MOE model should be faster than the dense model?