QuantTrio/DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Compact

Jun 3

Very good work, very consistent with everyone's expected direction. Based on the common single 768GB VRAM standard, INT8 retains more layers related to mathematics and programming while maintaining a 64K context. But before that, it still needs to be resolved. After replacing the relevant files, it is still easy to make errors in actual reasoning, such as garbled text and gibberish. Especially when two or more threads are concurrent, there will always be token encoding and decoding errors.

JunHowie

QuantTrio org Jun 3

Hello, thank you very much for your attention. Can you provide some Bad case references? We haven't done a complete benchmark test yet

tclf90

QuantTrio org Jun 3

•

edited Jun 3

Hi @su400 ,

vLLM has just released v0.9.0.1, which specifically addresses issue #19007. Could you check whether your problem matches that bug? If so, upgrading to v0.9.0.1 and rerunning your test may resolve it.

If the upgrade doesn’t fully solve things—and the root cause turns out to be pure quantization quality rather than a vLLM bug—we can explore a higher-fidelity variant of our model. For reference, NVIDIA’s DeepSeek-R1-FP4 leaves every self_attn layer unquantized. If that approach proves effective, we could adopt a similar strategy: start from the current Compact build and re-quantize all of the self_attn layers in Int8, so to create a “Higher-Fidelity” version.

Let me know how the new vLLM version works for you, and we can decide next steps from there.

su400

Jun 3

•

edited Jun 3

Assistant｜>', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.5, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4000, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-02 16:02:36 [async_llm.py:261] Added request chatcmpl-f90616ac493d4c40b6291767a5696e9c.
ERROR 06-02 16:02:40 [async_llm.py:408] AsyncLLM output_handler failed.
ERROR 06-02 16:02:40 [async_llm.py:408] Traceback (most recent call last):
ERROR 06-02 16:02:40 [async_llm.py:408] File "/home/kkk/miniconda3/envs/vllm1/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 384, in output_handler
ERROR 06-02 16:02:40 [async_llm.py:408] processed_outputs = output_processor.process_outputs(
ERROR 06-02 16:02:40 [async_llm.py:408] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-02 16:02:40 [async_llm.py:408] File "/home/kkk/miniconda3/envs/vllm1/lib/python3.12/site-packages/vllm/v1/engine/output_processor.py", line 350, in process_outputs
ERROR 06-02 16:02:40 [async_llm.py:408] stop_string = req_state.detokenizer.update(
ERROR 06-02 16:02:40 [async_llm.py:408] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-02 16:02:40 [async_llm.py:408] File "/home/kkk/miniconda3/envs/vllm1/lib/python3.12/site-packages/vllm/v1/engine/detokenizer.py", line 106, in update
ERROR 06-02 16:02:40 [async_llm.py:408] self.output_text += self.decode_next(new_token_id)
ERROR 06-02 16:02:40 [async_llm.py:408] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-02 16:02:40 [async_llm.py:408] File "/home/kkk/miniconda3/envs/vllm1/lib/python3.12/site-packages/vllm/v1/engine/detokenizer.py", line 201, in decode_next
ERROR 06-02 16:02:40 [async_llm.py:408] token = self.stream.step(self.tokenizer, next_token_id)
ERROR 06-02 16:02:40 [async_llm.py:408] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-02 16:02:40 [async_llm.py:408] OverflowError: out of range integral type conversion attempted
INFO 06-02 16:02:40 [async_llm.py:420] Aborted request chatcmpl-fa49847b4ca04dcc82adbcdf6203afa1.
INFO 06-02 16:02:40 [async_llm.py:346] Request chatcmpl-fa49847b4ca04dcc82adbcdf6203afa1 failed.
ERROR 06-02 16:02:40 [serving_chat.py:884] Error in chat completion stream generator.
ERROR 06-02 16:02:40 [serving_chat.py:884] Traceback (most recent call last):
ERROR 06-02 16:02:40 [serving_chat.py:884] File "/home/kkk/miniconda3/envs/vllm1/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 315, in generate
ERROR 06-02 16:02:40 [serving_chat.py:884] out = q.get_nowait() or await q.get()
ERROR 06-02 16:02:40 [serving_chat.py:884] ^^^^^^^^^^^^^
ERROR 06-02 16:02:40 [serving_chat.py:884] File "/home/kkk/miniconda3/envs/vllm1/lib/python3.12/site-packages/vllm/v1/engine/output_processor.py", line 51, in get
ERROR 06-02 16:02:40 [serving_chat.py:884] raise output
ERROR 06-02 16:02:40 [serving_chat.py:884] File "/home/kkk/miniconda3/envs/vllm1/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 384, in output_handler
ERROR 06-02 16:02:40 [serving_chat.py:884] processed_outputs = output_processor.process_outputs(
ERROR 06-02 16:02:40 [serving_chat.py:884] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-02 16:02:40 [serving_chat.py:884] File "/home/kkk/miniconda3/envs/vllm1/lib/python3.12/site-packages/vllm/v1/engine/output_processor.py", line 350, in process_outputs
ERROR 06-02 16:02:40 [serving_chat.py:884] stop_string = req_state.detokenizer.update(
ERROR 06-02 16:02:40 [serving_chat.py:884] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-02 16:02:40 [serving_chat.py:884] File "/home/kkk/miniconda3/envs/vllm1/lib/python3.12/site-packages/vllm/v1/engine/detokenizer.py", line 106, in update
ERROR 06-02 16:02:40 [serving_chat.py:884] self.output_text += self.decode_next(new_token_id)
ERROR 06-02 16:02:40 [serving_chat.py:884] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-02 16:02:40 [serving_chat.py:884] File "/home/kkk/miniconda3/envs/vllm1/lib/python3.12/site-packages/vllm/v1/engine/detokenizer.py", line 201, in decode_next
ERROR 06-02 16:02:40 [serving_chat.py:884] token = self.stream.step(self.tokenizer, next_token_id)
ERROR 06-02 16:02:40 [serving_chat.py:884] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-02 16:02:40 [serving_chat.py:884] OverflowError: out of range integral type conversion attempted
ERROR 06-02 16:02:40 [serving_chat.py:884]
ERROR 06-02 16:02:40 [serving_chat.py:884] The above exception was the direct cause of the following exception:
ERROR 06-02 16:02:40 [serving_chat.py:884]
ERROR 06-02 16:02:40 [serving_chat.py:884] Traceback (most recent call last):
ERROR 06-02 16:02:40 [serving_chat.py:884] File "/home/kkk/miniconda3/envs/vllm1/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_chat.py", line 476, in chat_completion_stream_generator
ERROR 06-02 16:02:40 [serving_chat.py:884] async for res in result_generator:
ERROR 06-02 16:02:40 [serving_chat.py:884] File "/home/kkk/miniconda3/envs/vllm1/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 347, in generate
ERROR 06-02 16:02:40 [serving_chat.py:884] raise EngineGenerateError() from e
ERROR 06-02 16:02:40 [serving_chat.py:884] vllm.v1.engine.exceptions.EngineGenerateError
I'm using vLLM v0.9.0.1 with the v1 engine, and have invoked flashinfer-python 0.2.5. The above log shows errors encountered during multi-threaded concurrent execution. In single-threaded mode, outputs occasionally show abnormalities but remain operational. I'm performing pipeline-parallel inference across two servers, with a total of 768GB of GPU memory.

tclf90

QuantTrio org Jun 3

Let's try V0 as suggested in README.md, currently V1 is still new, could have glitches here and there.

Before you launch vllm, try the following command

export VLLM_USE_V1=0

su400

Jun 3

I used export VLLM_USE_V1=0. There were no issues in single-threaded mode, but this error occurred when running three to four threads simultaneously.
Verify Model Quantization Configuration

The DeepSeek V2 model employs GPTQ + Marlin 8-bit quantization, where the Marlin kernel requires moe_block_size to be exactly 64. Your model might have been incorrectly configured with moe_block_size=128 during quantization.

Solution:

If using the vLLM framework's --moe-block-size argument, set it to 64: 
bash

1
--moe-block-size 64
If using a model configuration file (e.g., config.json), check for the moe_block_size parameter and update its value to 64.

su400

Jun 3

(RayWorkerWrapper pid=5471, ip=192.168.0.177) ERROR 06-03 12:21:01 [worker_base.py:620] File "/home/kkk/miniconda3/envs/vllm1/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 135, in fused_marlin_moe [repeated 7x across cluster]
(RayWorkerWrapper pid=5471, ip=192.168.0.177) ERROR 06-03 12:21:01 [worker_base.py:620] intermediate_cache1 = ops.moe_wna16_marlin_gemm( [repeated 7x across cluster]
(RayWorkerWrapper pid=5471, ip=192.168.0.177) ERROR 06-03 12:21:01 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 7x across cluster]
(RayWorkerWrapper pid=5471, ip=192.168.0.177) ERROR 06-03 12:21:01 [worker_base.py:620] File "/home/kkk/miniconda3/envs/vllm1/lib/python3.12/site-packages/vllm/_custom_ops.py", line 1489, in moe_wna16_marlin_gemm [repeated 7x across cluster]
(RayWorkerWrapper pid=5471, ip=192.168.0.177) ERROR 06-03 12:21:01 [worker_base.py:620] return torch.ops._moe_C.moe_wna16_marlin_gemm( [repeated 7x across cluster]
(RayWorkerWrapper pid=5471, ip=192.168.0.177) ERROR 06-03 12:21:01 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 7x across cluster]
(RayWorkerWrapper pid=5471, ip=192.168.0.177) ERROR 06-03 12:21:01 [worker_base.py:620] RuntimeError: unsupported moe_block_size=128 [repeated 7x across cluster]
(RayWorkerWrapper pid=6965) [rank6]:[W603 12:15:15.239607750 ProcessGroupNCCL.cpp:3629] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [repeated 14x across cluster]
(RayWorkerWrapper pid=6965) /home/kkk/miniconda3/envs/vllm1/lib/python3.12/site-packages/vllm/distributed/parallel_state.py:488: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1577.) [repeated 6x across cluster]
(RayWorkerWrapper pid=6965) object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8) [repeated 6x across cluster]
[rank0]:[W603 12:21:02.683405412 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
/home/kkk/miniconda3/envs/vllm1/lib/python3.12/multiprocessing/resource_tracker.py:255: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

tclf90

QuantTrio org Jun 3

•

edited Jun 3

This model is quantized with block size 128, it shouldn't be run with block 64...
We will take a look into the issue.
It would be helpful if you can share with us your vllm command, so we can reproduce the issue.
Thanks~

su400

Jun 3

•

edited Jun 3

conda activate vllm1
export NCCL_SOCKET_IFNAME=ens31f0np0
export GLOO_SOCKET_IFNAME=ens31f0np0
export NCCL_IB_DISABLE=1
export NCCL_DEBUG=INFO
export VLLM_DISTRIBUTED_INIT_METHOD="env://"
export RAY_CGRAPH_get_timeout=160
export CUDA_DEVICE_ORDER=PCI_BUS_ID

ray start --head
--node-ip-address=192.168.0.177
--port=6379
--num-gpus=8
--min-worker-port=10002
--max-worker-port=10100
--dashboard-host=0.0.0.0
export VLLM_USE_V1=0

conda activate vllm1
export NCCL_SOCKET_IFNAME=eno1np0
export GLOO_SOCKET_IFNAME=eno1np0
export NCCL_IB_DISABLE=1
export NCCL_DEBUG=INFO
export VLLM_DISTRIBUTED_INIT_METHOD="env://"
export RAY_CGRAPH_get_timeout=160
export CUDA_DEVICE_ORDER=PCI_BUS_ID

ray start --address=192.168.0.177:6379
--num-gpus=8
--min-worker-port=10002
--max-worker-port=10100
export VLLM_USE_V1=0

RAY_IGNORE_UNHANDLED_ERRORS=1 python -m vllm.entrypoints.openai.api_server
--model /home/kkk/ai/models/DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Compact
--tensor-parallel-size 8
--pipeline-parallel-size 2
--enable-chunked-prefill
--host 0.0.0.0
--port 9997
--enable-prefix-caching
--served-model-name DeepSeek-R1
--gpu-memory-utilization 0.95
--trust-remote-code
--max-num-batched-tokens 65535
--max-model-len 65535
--swap-space 24
--max-num-seqs 16
--enable-reasoning
--reasoning-parser deepseek_r1
--enable-chunked-prefill