Are there any updates to the recommended commands?
#27
by
NaiveYan
- opened
I tested the command in the current README with vLLM v0.8.0 (on 8 x A800 GPUs), but it only returns garbled text.
Are there any updates to the recommended commands, or are there other inference engines you would suggest?
硬件环境:8*H800
软件环境:vllm==0.8.1
我用V1引擎去跑以下命令发现输入4k以上文本时会出现
File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 688, in forward
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] hidden_states = self.model(input_ids, positions, intermediate_tensors,
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 245, in __call__
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] model_output = self.forward(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 626, in forward
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] def forward(
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return self._call_impl(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return forward_call(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return fn(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/fx/graph_module.py", line 822, in call_wrapped
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return self._wrapped_call(self, *args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/fx/graph_module.py", line 400, in __call__
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] raise e
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/fx/graph_module.py", line 387, in __call__
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return self._call_impl(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return forward_call(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "<eval_with_key>.124", line 2186, in forward
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] submod_1 = self.submod_1(getitem, s0, getitem_1, getitem_2, getitem_3); getitem = getitem_1 = getitem_2 = submod_1 = None
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/fx/graph_module.py", line 822, in call_wrapped
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return self._wrapped_call(self, *args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/fx/graph_module.py", line 400, in __call__
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] raise e
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/fx/graph_module.py", line 387, in __call__
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return self._call_impl(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return forward_call(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "<eval_with_key>.2", line 5, in forward
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(x_7, x_11, k_pe, output_5, 'model.layers.0.self_attn.attn'); x_7 = x_11 = k_pe = output_5 = unified_attention_with_output = None
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/_ops.py", line 1123, in __call__
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return self._op(*args, **(kwargs or {}))
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/vllm/attention/layer.py", line 363, in unified_attention_with_output
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] self.impl.forward(self,
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/vllm/v1/attention/backends/mla/common.py", line 929, in forward
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] output[num_decode_tokens:] = self._forward_prefill(
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/vllm/v1/attention/backends/mla/common.py", line 826, in _forward_prefill
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] context_output, context_lse = self._compute_prefill_context( \
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/vllm/v1/attention/backends/mla/common.py", line 742, in _compute_prefill_context
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] kv_nope = self.kv_b_proj(kv_c_normed)[0].view( \
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return self._call_impl(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return forward_call(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 474, in forward
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] output_parallel = self.quant_method.apply(self, input_, bias)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 303, in apply
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return apply_awq_marlin_linear(
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 379, in apply_awq_marlin_linear
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] output = ops.gptq_marlin_gemm(reshaped_x,
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/vllm/_custom_ops.py", line 741, in gptq_marlin_gemm
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return torch.ops._C.gptq_marlin_gemm(a, b_q_weight, b_scales, b_zeros,
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/_ops.py", line 1123, in __call__
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return self._op(*args, **(kwargs or {}))
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] RuntimeError: A is not contiguous
运行命令:
export VLLM_USE_TRITON_FLASH_ATTN=1
export VLLM_USE_FLASHINFER_SAMPLER=1
export VLLM_FLASHINFER_FORCE_TENSOR_CORES=1
export VLLM_USE_V1=1
export VLLM_ENABLE_V1_MULTIPROCESSING=1
export VLLM_ATTENTION_BACKEND=FLASHMLA
export TORCH_CUDA_ARCH_LIST=9.0
vllm serve /DeepSeek-R1-awq --host 0.0.0.0 --port 8080 --trust-remote-code --max-model-len 65536 --max-num-batched-tokens 65536 --max-seq-len-to-capture 65536 --gpu-memory-utilization 0.95 --max-num-seqs 64 --served-model-name DeepSeek-R1 --tensor-parallel-size 8 --enable-reasoning --reasoning-parser deepseek_r1 -q awq_marlin
v2ray
changed discussion status to
closed
You need to merge all 3 PRs, one of them switches to the Marlin kernel which supports non contiguous input.