Quickly degrades
First off - thank you for the awesome AWQ quants!
I have got it running on my 8x 3090 node with this command:
vllm serve QuantTrio/MiniMax-M2-AWQ \
--served-model-name MiniMax-M2 \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--swap-space 16 \
--max-num-seqs 3 \
--max-model-len 131072 \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--trust-remote-code
But I get this odd output where in the beginning it seems ok, but it quickly degrades as it generates.
Do you have advice for this situation?
Thanks!
Eric Hartford
Sounds about right to me... That's how my thoughts sound, when my ADHD Meds wear off towards the end of the day🙃 lol!
Same issue here using recommended temp/top_k/top_p - totally incoherent babbling
Here is the output on my 4090 rig (using default params)
<think>The user requests a random fun fact about the Roman Empire. This is a benign request. There's no policy violation. We'll just provide a fun fact about the Roman Empire. Ensure we answer succinctly. Provide interesting fact, perhaps about the Roman baths, the Roman military engineering, or some other interesting fact.
Given the request "Tell me a random fun fact about the Roman Empire", we can comply. The answer is short and factual. We'll produce a fun fact. We'll be careful not to provide disallowed content. There's no mention of personal data. It's straightforward.
Thus the response: "Roman Emperor Nero fiddled while Rome burned? That's a myth. A more fun fact: The Romans used a form of credit called 'pecunia' and 'nummus' and had complex loan contracts." Or something about the existence of a massive network of roads. Or about the invention of concrete and the Pantheon dome. Or about the existence of a Roman city with a population that had 10,000 people living under one roof (the insulae). Or about the use of lead pipes for water distribution. Or about Roman emperors being declared divine. Or about their obsession with baths. Or about the fact that the Roman army used a "corps of engineers" that built roads, aqueducts, and catapults.
Pick one. Let's say: "Roman soldiers used a type of lead pipe to supply water to cities, and they made lead pipes that contributed to lead poisoning". That could be interesting but might be a bit dark. Better something more fun: "The Romans had a massive system of aqueducts that could transport water for miles, and at its height the network spanned over 400 miles in Rome alone." That might be a fun fact. Or "The Romans built a structure called the Pantheon, which has a dome with a central opening (oculus) that is still the world's largest unreinforced dome." That's also interesting.
Alternatively: "The Roman Empire had a system of 'tabeliones' who kept records, making early forms of paperwork." That's more obscure.
Better: "The Roman Empire's army used a type of military engineering: they built roads and bridges that still exist in many forms." That is not really fun. "Roman Emperor Hadrian built a massive wall across northern England (Hadrian's Wall) that was 73 miles long and took about 9 years to complete." That's interesting but more historical.
Fun fact: "The Romans invented a type of concrete that used volcanic ash and set underwater, which is why the Pantheon still stands today." That's also a good fact.
The user likely expects a random fun fact that is short.
We'll answer with a fun fact: "Roman soldiers often carried a 'pudenda' - a sack containing personal items." That is more obscure.
Better to choose something like "Roman baths often had piped hot and cold water with sophisticated heating systems, the 'hypocaust', to keep the water hot." That is known but could be considered fun.
Alternatively: "Roman soldiers used a form of a credit system, called 'pecunia', and had the concept of 'usury' regulated by law, with interest rates up to 12% per year."
Given the user probably wants something that stands out: The Roman Empire once had a 'Great Fire of Rome' that led to the construction of a new city plan with a grid system. Also that they used a secret password system "signum". Possibly the best: "The Romans built a 100-foot-long aqueduct that supplied the city of Rome with water and had to have a 'siphoning' technique, using inverted siphons made of stone pipes to transport water across valleys."
Better is: "The Roman Empire had a system of 'collegia' - professional associations - which organized everything from burial societies to entertainers. The Romans even had a 'professional gladiator' guild with its own union." That is interesting but maybe too obscure.
The best: "The Roman Empire had a unique, long-lasting form of engineering: the Pantheon's dome with a central oculus - a hole at the top that allowed light to stream down, and it remains the largest unreinforced dome ever built." That is a fun fact.
Another: "Roman soldiers were required to have a 'cervical collar' known as 'pugio' to protect their necks." That's not correct.
Better: "Roman army's supply lines were called 'viae militares' and were so well developed that they still affect Europe's road network." That is true.
Thus, answer: "The Roman Empire had a network of roads totaling roughly 250,000 miles (400,000 km) and some of those roads are still used today, being the longest continuously used roads." That's fun.
Alternatively, "Rome was the first empire to have an official currency, the 'denarius', which remained in circulation for centuries." Not exactly.
Let's choose something with a fun twist: "Romans used 'spongia' - sponges made from sea sponges - to wipe their faces." Not interesting.
Better: "The Roman army used a 'cavalry' with a horse harness called 'tribulum' that allowed them to turn quickly." Not sure.
Better: "Romans invented 'concrete' using volcanic ash, which allowed the construction of large structures like the Colosseum."
Yes. That is a fun fact: "The Pantheon, built around 126 AD, features a dome with a central opening (oculus) that is 43 feet in diameter, making it the world's largest unreinforced concrete dome, still standing after 2,000 years."
Given the simplicity, I can respond with one or two fun facts. Provide a short but interesting fact.
I think I'll answer: "Roman Empire had a system of aqueducts that used the 'inverted siphon' technique to carry water across valleys, making possible massive water supply that could deliver more than 2 million gallons per day to Rome's fountains."
But the user likely expects one fun fact. Provide it. Let's be succinct and interesting.
Thus answer:
"One amusing fact: The Romans built the world’s first known "public libraries", but they also had a system of private 'libraries' where wealthy Romans displayed scrolls as a status symbol – even the walls of some Roman apartments were lined with shelves for hundreds of scrolls."
Probably that is okay.
Alternatively: "Roman Emperors were sometimes declared 'God' after death; Caligula's statue was worshipped."
I think the Pantheon fact is best.
Thus final answer: a fun fact about the Roman Empire. Let's answer with: "The Pantheon's dome remains the largest unreinforced concrete dome and the only one with a central opening (the oculus)."
Yes.
Now produce answer.
</think>
**Fun Roman‑Empire fact:**
The Roman Empire built the **Pantheon** in Rome (c. 126 AD). Its dome is the largest unreinforced concrete dome ever constructed, measuring about 43 feet (13 m) across. At the very top of the dome there’s an oculus – a circular opening that lets daylight stream in and rain fall through, creating an ever‑changing “celestial” glow. Even after more than 2,000 years, the Pantheon still stands without any steel or other reinforcements, a testament to Roman concrete’s remarkable durability.
another one
<think>I need to tell a fun fact about the Roman Empire! Here’s one: the empire covered around 1.7 million square miles at its peak, about half the size of the USA, and had around 50-70 million inhabitants. That's fascinating! Another fun fact is that "spaghetti" isn't from ancient Rome; instead, Romans enjoyed pasta, but it was more like "lagane" or "macaria," made from durum wheat, often baked with cheese and honey. There are many more interesting tidbits to explore!
</think>
### Fun fact: The Romans built a 19‑meter-wide, 600‑km-long superhighway called the Via Appia
Known as the “Queen of Roads,” the Via Appia connected Rome to Brindisi, allowing troop movements and trade across the peninsula. It was such an engineering marvel that parts are still visible today.
Dual RTX 6000 Pro Blackwell setup here, latest nightly vllm container, recommended settings. First prompt is ok then it goes bonkers and never stops outputting:
Tried changing the "model_type": "mixtral", to "model_type": "minimax", in config.json, tried copying generate_config.json from original repo but not really any improvement.
docker run -it --rm
--name ${CONTAINER_NAME}
--runtime nvidia --gpus all
-v ~/.cache/huggingface:/root/.cache/huggingface
-v /home/shane/models:/models
--add-host="host.docker.internal:host-gateway"
-p 8000:8000
--ipc=host
vllm/vllm-openai:nightly
--model /models/MiniMax-M2-AWQ
--served-model-name "MiniMax-M2"
--host 0.0.0.0
--port 8000
--swap-space 16
--max-num-seqs 32
--enable-expert-parallel
--tensor-parallel-size 2
--trust-remote-code
--max-model-len 32768
--enable-auto-tool-choice
--tool-call-parser minimax_m2
--reasoning-parser minimax_m2_append_think
--gpu-memory-utilization 0.9
I just tried like 10 times in a row on my 4090 machine, all of those results were coherent.
At this point, Im not sure what is the cause 🥹.
What version of pytorch, vllm, cuda are you on?
Can you share your vllm arguments?
ubuntu 22, python3.12, cuda 12.8
vLLM installed using
pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
served using
export CUDA_VISIBLE_DEVICES=0,1,2,3
export OMP_NUM_THREADS=4
vllm serve \
$MODEL_PATH \
--served-model-name $MODEL_NAME \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--swap-space 4 \
--max-num-seqs 8 \
--max-model-len 131072 \
--gpu-memory-utilization 0.90 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--distributed-executor-backend mp \
--trust-remote-code \
--host 0.0.0.0 \
--port 12345
pip list
Package Version
---------------------------------- --------------------------------
aiohappyeyeballs 2.6.1
aiohttp 3.13.2
aiosignal 1.4.0
annotated-doc 0.0.3
annotated-types 0.7.0
anthropic 0.71.0
anyio 4.11.0
apache-tvm-ffi 0.1.0b15
astor 0.8.1
attrs 25.4.0
blake3 1.0.8
cachetools 6.2.1
cbor2 5.7.1
certifi 2025.10.5
cffi 2.0.0
charset-normalizer 3.4.4
click 8.2.1
cloudpickle 3.1.1
compressed-tensors 0.12.2
cuda-bindings 13.0.3
cuda-pathfinder 1.3.2
cuda-python 13.0.3
cupy-cuda12x 13.6.0
depyf 0.20.0
dill 0.4.0
diskcache 5.6.3
distro 1.9.0
dnspython 2.8.0
docstring_parser 0.17.0
einops 0.8.1
email-validator 2.3.0
fastapi 0.120.2
fastapi-cli 0.0.14
fastapi-cloud-cli 0.3.1
fastrlock 0.8.3
filelock 3.20.0
flashinfer-python 0.4.1
frozenlist 1.8.0
fsspec 2025.9.0
gguf 0.17.1
h11 0.16.0
hf-xet 1.2.0
httpcore 1.0.9
httptools 0.7.1
httpx 0.28.1
huggingface-hub 0.36.0
idna 3.11
importlib_metadata 8.7.0
interegular 0.3.3
Jinja2 3.1.6
jiter 0.11.1
jsonschema 4.25.1
jsonschema-specifications 2025.9.1
lark 1.2.2
llguidance 0.7.30
llvmlite 0.44.0
lm-format-enforcer 0.11.3
loguru 0.7.3
markdown-it-py 4.0.0
MarkupSafe 3.0.3
mdurl 0.1.2
mistral_common 1.8.5
mpmath 1.3.0
msgpack 1.1.2
msgspec 0.19.0
multidict 6.7.0
networkx 3.5
ninja 1.13.0
numba 0.61.2
numpy 2.2.6
nvidia-cublas-cu12 12.8.4.1
nvidia-cuda-cupti-cu12 12.8.90
nvidia-cuda-nvrtc-cu12 12.8.93
nvidia-cuda-runtime-cu12 12.8.90
nvidia-cudnn-cu12 9.10.2.21
nvidia-cudnn-frontend 1.15.0
nvidia-cufft-cu12 11.3.3.83
nvidia-cufile-cu12 1.13.1.3
nvidia-curand-cu12 10.3.9.90
nvidia-cusolver-cu12 11.7.3.90
nvidia-cusparse-cu12 12.5.8.93
nvidia-cusparselt-cu12 0.7.1
nvidia-cutlass-dsl 4.3.0.dev0
nvidia-ml-py 13.580.82
nvidia-nccl-cu12 2.27.5
nvidia-nvjitlink-cu12 12.8.93
nvidia-nvshmem-cu12 3.3.20
nvidia-nvtx-cu12 12.8.90
openai 2.6.1
openai-harmony 0.0.4
opencv-python-headless 4.12.0.88
opentelemetry-api 1.38.0
opentelemetry-sdk 1.38.0
opentelemetry-semantic-conventions 0.59b0
outlines_core 0.2.11
packaging 25.0
partial-json-parser 0.2.1.1.post6
pillow 12.0.0
pip 25.0.1
prometheus_client 0.23.1
prometheus-fastapi-instrumentator 7.1.0
propcache 0.4.1
protobuf 6.33.0
psutil 7.1.2
py-cpuinfo 9.0.0
pybase64 1.4.2
pycountry 24.6.1
pycparser 2.23
pydantic 2.12.3
pydantic_core 2.41.4
pydantic-extra-types 2.10.6
Pygments 2.19.2
python-dotenv 1.2.1
python-json-logger 4.0.0
python-multipart 0.0.20
PyYAML 6.0.3
pyzmq 27.1.0
ray 2.51.0
referencing 0.37.0
regex 2025.10.23
requests 2.32.5
rich 14.2.0
rich-toolkit 0.15.1
rignore 0.7.2
rpds-py 0.28.0
safetensors 0.6.2
scipy 1.16.3
sentencepiece 0.2.1
sentry-sdk 3.0.0a7
setproctitle 1.3.7
setuptools 79.0.1
shellingham 1.5.4
six 1.17.0
sniffio 1.3.1
soundfile 0.13.1
soxr 1.0.0
starlette 0.49.1
sympy 1.14.0
tabulate 0.9.0
tiktoken 0.12.0
tokenizers 0.22.1
torch 2.9.0
torchaudio 2.9.0
torchvision 0.24.0
tqdm 4.67.1
transformers 4.57.1
triton 3.5.0
typer 0.20.0
typing_extensions 4.15.0
typing-inspection 0.4.2
urllib3 2.5.0
uvicorn 0.38.0
uvloop 0.22.1
vllm 0.11.1rc5.dev34+g48eb8eba5.cu129
watchfiles 1.1.1
websockets 15.0.1
xgrammar 0.1.25
yarl 1.22.0
zipp 3.23.0
Tried these exact settings (but modified for 2 gpus) - first prompt works, then just totally crazy. Tried latest vllm nightly pushed 7 hrs ago as well. Must be some issue with blackwell.
Tried these exact settings (but modified for 2 gpus) - first prompt works, then just totally crazy. Tried latest vllm nightly pushed 7 hrs ago as well. Must be some issue with blackwell.
have you tried the uv method as written in vLLM M2 official guide?
uv venv
source .venv/bin/activate
uv pip install 'triton-kernels @ git+https://github.com/triton-lang/[email protected]#subdirectory=python/triton_kernels' \
vllm --extra-index-url https://wheels.vllm.ai/nightly --prerelease=allow
Yes, unfortunately the install seems broken at the moment and can't resolve dependencies.
The LLM starts generating output but quickly goes off-track, repeating words and phrases endlessly without a proper stop token.
using recommanded top_p, top_k, temperature
"temperature": 1.0,
"top_p": 0.95,
"stream": false,
"top_k": 40
Could it be due to file #27 being missing as reported in the other post?
darn, yes...
Sorry, that was my mistake. I’ve now uploaded model-00027-of-00041.safetensors.
Thanks - looks like we're actually missing 18, 19, 21, 23, 25, 30, 35 as well. Can't wait to run it
I’m in the process of uploading the missing files. This is definitely my mistake, and I sincerely apologize for the inconvenience. I’ll double-check everything to make sure it’s all in order!
All good, love these quants!
Due to my oversight, I have now completed the missing files and fully validated them on a 2xA100 setup.
Upload model-00018-of-00041.safetensors
Upload model-00019-of-00041.safetensors
Upload model-00021-of-00041.safetensors
Upload model-00023-of-00041.safetensors
Upload model-00025-of-00041.safetensors
Upload model-00027-of-00041.safetensors
Upload model-00030-of-00041.safetensors
Upload model-00035-of-00041.safetensors
amazing, I will give it a try!
Looks solid! Its consistently telling me fun roman facts I wish I didn't know now
👌
So far this model seems to work flawlessly with Kilo Code and Claude Code. Really amazing, thank you so much for this!
would you kow why get stuck in the end with this message repeating?
admin@jarvis:~/dev-ai/llama.cpp$ docker run --name vllm-jarvis-thinker \
-it --rm \
--gpus '"device=0,1"' \
-p 10002:8000 \
-v /media/mukul/data/models:/models \
-e SAFETENSORS_FAST_GPU=1 \
-e VLLM_MOE_USE_DEEP_GEMM=0 \
vllm/vllm-openai:latest \
--model /models/QuantTrio/MiniMax-M2-AWQ \
--served-model-name jarvis-thinker \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--max-model-len 131072 \
--gpu-memory-utilization 0.90 \
--tensor-parallel-size 2 \
--enable-expert-parallel \
--distributed-executor-backend mp \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000
WARNING 12-03 08:26:20 [argparse_utils.py:195] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in v0.13.
(APIServer pid=1) INFO 12-03 08:26:20 [api_server.py:1772] vLLM API server version 0.12.0
(APIServer pid=1) INFO 12-03 08:26:20 [utils.py:253] non-default args: {'model_tag': '/models/QuantTrio/MiniMax-M2-AWQ', 'host': '0.0.0.0', 'enable_auto_tool_choice': True, 'tool_call_parser': 'minimax_m2', 'model': '/models/QuantTrio/MiniMax-M2-AWQ', 'trust_remote_code': True, 'max_model_len': 131072, 'served_model_name': ['jarvis-thinker'], 'reasoning_parser': 'minimax_m2_append_think', 'distributed_executor_backend': 'mp', 'tensor_parallel_size': 2, 'enable_expert_parallel': True}
(APIServer pid=1) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1) INFO 12-03 08:26:25 [model.py:637] Resolved architecture: MiniMaxM2ForCausalLM
(APIServer pid=1) INFO 12-03 08:26:25 [model.py:1750] Using max model len 131072
(APIServer pid=1) INFO 12-03 08:26:25 [awq_marlin.py:163] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
(APIServer pid=1) INFO 12-03 08:26:25 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=8192.
(EngineCore_DP0 pid=270) INFO 12-03 08:26:31 [core.py:93] Initializing a V1 LLM engine (v0.12.0) with config: model='/models/QuantTrio/MiniMax-M2-AWQ', speculative_config=None, tokenizer='/models/QuantTrio/MiniMax-M2-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='minimax_m2_append_think', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01), seed=0, served_model_name=jarvis-thinker, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>}, 'local_cache_dir': None}
(EngineCore_DP0 pid=270) WARNING 12-03 08:26:31 [multiproc_executor.py:880] Reducing Torch parallelism from 56 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 12-03 08:26:36 [parallel_state.py:1200] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:34151 backend=nccl
INFO 12-03 08:26:36 [parallel_state.py:1200] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:34151 backend=nccl
INFO 12-03 08:26:36 [pynccl.py:111] vLLM is using nccl==2.27.5
WARNING 12-03 08:26:36 [symm_mem.py:67] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
WARNING 12-03 08:26:36 [symm_mem.py:67] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
INFO 12-03 08:26:36 [parallel_state.py:1408] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank 1
INFO 12-03 08:26:36 [parallel_state.py:1408] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(Worker_TP0_EP0 pid=404) INFO 12-03 08:26:37 [gpu_model_runner.py:3467] Starting to load model /models/QuantTrio/MiniMax-M2-AWQ...
(Worker_TP0_EP0 pid=404) [2025-12-03 08:26:37] INFO _optional_torch_c_dlpack.py:119: JIT-compiling torch-c-dlpack-ext to cache...
(Worker_TP1_EP1 pid=405) [2025-12-03 08:26:37] INFO _optional_torch_c_dlpack.py:119: JIT-compiling torch-c-dlpack-ext to cache...
(Worker_TP1_EP1 pid=405) INFO 12-03 08:26:51 [cuda.py:411] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
(Worker_TP1_EP1 pid=405) INFO 12-03 08:26:51 [layer.py:379] Enabled separate cuda stream for MoE shared_experts
(Worker_TP0_EP0 pid=404) INFO 12-03 08:26:51 [cuda.py:411] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
(Worker_TP0_EP0 pid=404) INFO 12-03 08:26:51 [layer.py:379] Enabled separate cuda stream for MoE shared_experts
(Worker_TP1_EP1 pid=405) INFO 12-03 08:26:51 [layer.py:499] [EP Rank 1/2] Expert parallelism is enabled. Expert placement strategy: linear. Local/global number of experts: 128/256. Experts local to global index map: 0->128, 1->129, 2->130, 3->131, 4->132, 5->133, 6->134, 7->135, 8->136, 9->137, 10->138, 11->139, 12->140, 13->141, 14->142, 15->143, 16->144, 17->145, 18->146, 19->147, 20->148, 21->149, 22->150, 23->151, 24->152, 25->153, 26->154, 27->155, 28->156, 29->157, 30->158, 31->159, 32->160, 33->161, 34->162, 35->163, 36->164, 37->165, 38->166, 39->167, 40->168, 41->169, 42->170, 43->171, 44->172, 45->173, 46->174, 47->175, 48->176, 49->177, 50->178, 51->179, 52->180, 53->181, 54->182, 55->183, 56->184, 57->185, 58->186, 59->187, 60->188, 61->189, 62->190, 63->191, 64->192, 65->193, 66->194, 67->195, 68->196, 69->197, 70->198, 71->199, 72->200, 73->201, 74->202, 75->203, 76->204, 77->205, 78->206, 79->207, 80->208, 81->209, 82->210, 83->211, 84->212, 85->213, 86->214, 87->215, 88->216, 89->217, 90->218, 91->219, 92->220, 93->221, 94->222, 95->223, 96->224, 97->225, 98->226, 99->227, 100->228, 101->229, 102->230, 103->231, 104->232, 105->233, 106->234, 107->235, 108->236, 109->237, 110->238, 111->239, 112->240, 113->241, 114->242, 115->243, 116->244, 117->245, 118->246, 119->247, 120->248, 121->249, 122->250, 123->251, 124->252, 125->253, 126->254, 127->255.
(Worker_TP0_EP0 pid=404) INFO 12-03 08:26:51 [layer.py:499] [EP Rank 0/2] Expert parallelism is enabled. Expert placement strategy: linear. Local/global number of experts: 128/256. Experts local to global index map: 0->0, 1->1, 2->2, 3->3, 4->4, 5->5, 6->6, 7->7, 8->8, 9->9, 10->10, 11->11, 12->12, 13->13, 14->14, 15->15, 16->16, 17->17, 18->18, 19->19, 20->20, 21->21, 22->22, 23->23, 24->24, 25->25, 26->26, 27->27, 28->28, 29->29, 30->30, 31->31, 32->32, 33->33, 34->34, 35->35, 36->36, 37->37, 38->38, 39->39, 40->40, 41->41, 42->42, 43->43, 44->44, 45->45, 46->46, 47->47, 48->48, 49->49, 50->50, 51->51, 52->52, 53->53, 54->54, 55->55, 56->56, 57->57, 58->58, 59->59, 60->60, 61->61, 62->62, 63->63, 64->64, 65->65, 66->66, 67->67, 68->68, 69->69, 70->70, 71->71, 72->72, 73->73, 74->74, 75->75, 76->76, 77->77, 78->78, 79->79, 80->80, 81->81, 82->82, 83->83, 84->84, 85->85, 86->86, 87->87, 88->88, 89->89, 90->90, 91->91, 92->92, 93->93, 94->94, 95->95, 96->96, 97->97, 98->98, 99->99, 100->100, 101->101, 102->102, 103->103, 104->104, 105->105, 106->106, 107->107, 108->108, 109->109, 110->110, 111->111, 112->112, 113->113, 114->114, 115->115, 116->116, 117->117, 118->118, 119->119, 120->120, 121->121, 122->122, 123->123, 124->124, 125->125, 126->126, 127->127.
Loading safetensors checkpoint shards: 0% Completed | 0/41 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 2% Completed | 1/41 [00:00<00:28, 1.42it/s]
Loading safetensors checkpoint shards: 5% Completed | 2/41 [00:01<00:23, 1.69it/s]
Loading safetensors checkpoint shards: 7% Completed | 3/41 [00:01<00:21, 1.80it/s]
Loading safetensors checkpoint shards: 10% Completed | 4/41 [00:02<00:19, 1.89it/s]
Loading safetensors checkpoint shards: 12% Completed | 5/41 [00:02<00:18, 1.95it/s]
Loading safetensors checkpoint shards: 15% Completed | 6/41 [00:03<00:20, 1.71it/s]
Loading safetensors checkpoint shards: 17% Completed | 7/41 [00:04<00:20, 1.68it/s]
Loading safetensors checkpoint shards: 20% Completed | 8/41 [00:04<00:21, 1.56it/s]
Loading safetensors checkpoint shards: 22% Completed | 9/41 [00:05<00:18, 1.77it/s]
Loading safetensors checkpoint shards: 24% Completed | 10/41 [00:05<00:16, 1.85it/s]
Loading safetensors checkpoint shards: 27% Completed | 11/41 [00:06<00:15, 1.90it/s]
Loading safetensors checkpoint shards: 29% Completed | 12/41 [00:06<00:14, 1.96it/s]
Loading safetensors checkpoint shards: 32% Completed | 13/41 [00:07<00:13, 2.01it/s]
Loading safetensors checkpoint shards: 34% Completed | 14/41 [00:07<00:12, 2.09it/s]
Loading safetensors checkpoint shards: 37% Completed | 15/41 [00:07<00:12, 2.15it/s]
Loading safetensors checkpoint shards: 39% Completed | 16/41 [00:08<00:11, 2.18it/s]
Loading safetensors checkpoint shards: 41% Completed | 17/41 [00:08<00:11, 2.18it/s]
Loading safetensors checkpoint shards: 44% Completed | 18/41 [00:09<00:10, 2.16it/s]
Loading safetensors checkpoint shards: 46% Completed | 19/41 [00:09<00:09, 2.23it/s]
Loading safetensors checkpoint shards: 49% Completed | 20/41 [00:10<00:09, 2.25it/s]
Loading safetensors checkpoint shards: 51% Completed | 21/41 [00:10<00:08, 2.23it/s]
Loading safetensors checkpoint shards: 54% Completed | 22/41 [00:11<00:08, 2.23it/s]
Loading safetensors checkpoint shards: 56% Completed | 23/41 [00:11<00:07, 2.31it/s]
Loading safetensors checkpoint shards: 59% Completed | 24/41 [00:11<00:07, 2.30it/s]
Loading safetensors checkpoint shards: 61% Completed | 25/41 [00:12<00:06, 2.30it/s]
Loading safetensors checkpoint shards: 63% Completed | 26/41 [00:12<00:06, 2.33it/s]
Loading safetensors checkpoint shards: 66% Completed | 27/41 [00:13<00:06, 2.31it/s]
Loading safetensors checkpoint shards: 68% Completed | 28/41 [00:13<00:04, 2.95it/s]
Loading safetensors checkpoint shards: 71% Completed | 29/41 [00:13<00:04, 2.75it/s]
Loading safetensors checkpoint shards: 73% Completed | 30/41 [00:14<00:04, 2.52it/s]
Loading safetensors checkpoint shards: 76% Completed | 31/41 [00:14<00:04, 2.37it/s]
Loading safetensors checkpoint shards: 78% Completed | 32/41 [00:15<00:03, 2.38it/s]
Loading safetensors checkpoint shards: 80% Completed | 33/41 [00:15<00:03, 2.36it/s]
Loading safetensors checkpoint shards: 83% Completed | 34/41 [00:15<00:02, 2.38it/s]
Loading safetensors checkpoint shards: 85% Completed | 35/41 [00:16<00:02, 2.36it/s]
Loading safetensors checkpoint shards: 88% Completed | 36/41 [00:16<00:02, 2.39it/s]
Loading safetensors checkpoint shards: 90% Completed | 37/41 [00:17<00:01, 2.31it/s]
Loading safetensors checkpoint shards: 93% Completed | 38/41 [00:17<00:01, 2.30it/s]
Loading safetensors checkpoint shards: 95% Completed | 39/41 [00:18<00:00, 2.30it/s]
Loading safetensors checkpoint shards: 98% Completed | 40/41 [00:18<00:00, 2.31it/s]
Loading safetensors checkpoint shards: 100% Completed | 41/41 [00:19<00:00, 2.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 41/41 [00:19<00:00, 2.15it/s]
(Worker_TP0_EP0 pid=404)
(Worker_TP0_EP0 pid=404) INFO 12-03 08:27:11 [default_loader.py:308] Loading weights took 19.07 seconds
(Worker_TP0_EP0 pid=404) INFO 12-03 08:27:17 [gpu_model_runner.py:3549] Model loading took 56.4011 GiB memory and 39.828850 seconds
(Worker_TP0_EP0 pid=404) INFO 12-03 08:27:26 [backends.py:655] Using cache directory: /root/.cache/vllm/torch_compile_cache/6ec3beba80/rank_0_0/backbone for vLLM's torch.compile
(Worker_TP0_EP0 pid=404) INFO 12-03 08:27:26 [backends.py:715] Dynamo bytecode transform time: 9.02 s
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:28.597000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:28.598000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:30.670000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:30.683000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) INFO 12-03 08:27:32 [backends.py:257] Cache the graph for dynamic shape for later use
(Worker_TP1_EP1 pid=405) /usr/local/lib/python3.12/dist-packages/torch/backends/cuda/__init__.py:131: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)
(Worker_TP1_EP1 pid=405) return torch._C._get_cublas_allow_tf32()
(Worker_TP1_EP1 pid=405) /usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_fx.py:312: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
(Worker_TP1_EP1 pid=405) warnings.warn(
(Worker_TP0_EP0 pid=404) /usr/local/lib/python3.12/dist-packages/torch/backends/cuda/__init__.py:131: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)
(Worker_TP0_EP0 pid=404) return torch._C._get_cublas_allow_tf32()
(Worker_TP0_EP0 pid=404) /usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_fx.py:312: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
(Worker_TP0_EP0 pid=404) warnings.warn(
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:32.601000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:32.642000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:32.791000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:32.833000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:33.849000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:34.187000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:34.387000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:34.431000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:36.386000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:36.447000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:36.579000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:36.637000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:37.393000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:37.458000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:37.584000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:37.649000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:38.392000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:38.464000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:38.590000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:38.659000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:39.395000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:39.465000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:39.586000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:39.660000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:40.394000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:40.471000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:40.582000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:40.666000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:41.393000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:41.488000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:41.582000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:41.680000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:42.399000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:42.508000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:42.592000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:42.700000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:43.509000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:43.697000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:43.883000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:44.084000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:44.897000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:45.033000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:45.096000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:45.230000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:45.898000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:46.031000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:46.095000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:46.229000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:46.918000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:47.048000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:47.117000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:47.248000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:47.915000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:48.045000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:48.113000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:48.248000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:48.920000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:49.051000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:49.117000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:49.252000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:49.919000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:50.063000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:50.116000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:50.263000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:50.925000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:51.069000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:51.122000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:51.262000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:51.925000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:52.075000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:52.125000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:52.274000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:53.495000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:53.638000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:53.693000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:53.830000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:54.524000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:54.666000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:54.717000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:54.861000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:55.533000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:55.675000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:55.722000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:55.870000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:56.524000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:56.670000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:56.711000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:56.865000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:57.523000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:57.680000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:57.714000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:57.872000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:58.540000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:58.693000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:58.726000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:58.883000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:59.537000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:59.701000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:27:59.725000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:27:59.890000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:00.533000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:00.702000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:00.719000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:00.891000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:01.530000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:01.713000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:01.723000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:01.903000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:03.140000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:03.331000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:03.335000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:03.527000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:04.152000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:04.343000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:04.350000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:04.551000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:05.198000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:05.389000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:05.434000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:05.630000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:06.202000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:06.393000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:06.443000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:06.654000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:07.219000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:07.410000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:07.478000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:07.677000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:08.222000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:08.411000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:08.492000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:08.690000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:09.228000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:09.418000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:09.509000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:09.702000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:10.227000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:10.416000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:10.518000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:10.711000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:11.235000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:11.424000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:11.526000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:11.720000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:12.806000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:13.000000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:13.111000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:13.305000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:13.815000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:14.007000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:14.129000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:14.319000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:14.825000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:15.015000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:15.145000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:15.336000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:15.822000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:16.010000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:16.144000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:16.340000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:16.819000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:17.009000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:17.155000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:17.349000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(EngineCore_DP0 pid=270) INFO 12-03 08:28:17 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:17.821000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:18.011000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:18.167000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:18.359000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:18.829000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:19.019000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:19.172000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:19.361000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:19.831000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:20.018000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:20.169000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:20.357000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:20.833000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:21.100000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:21.256000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:21.501000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:22.592000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:22.795000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:22.944000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:23.139000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:23.615000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:23.812000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:23.967000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:24.161000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:24.622000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:24.830000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:24.996000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:25.189000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:25.639000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:25.830000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:26.011000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:26.212000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:26.662000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:26.853000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:27.053000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:27.255000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:27.678000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:27.867000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:28.085000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:28.286000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:28.691000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:28.882000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:29.109000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:29.307000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:29.703000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:29.891000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:30.139000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:30.337000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:30.724000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:30.913000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:31.171000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:31.369000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:32.342000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:32.539000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:32.808000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:33.003000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:33.369000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:33.564000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:33.842000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:34.035000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:34.387000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:34.587000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:34.878000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:35.083000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:35.419000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:35.616000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:35.916000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:36.111000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:36.430000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:36.621000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:36.940000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:37.135000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:37.440000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:37.629000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:37.964000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:38.155000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:38.445000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP1_EP1 pid=405) [rank1]:W1203 08:28:38.640000 405 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:38.990000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:39.185000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) INFO 12-03 08:28:39 [backends.py:288] Compiling a graph for dynamic shape takes 72.44 s
(EngineCore_DP0 pid=270) INFO 12-03 08:29:17 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=270) INFO 12-03 08:30:17 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=270) INFO 12-03 08:31:17 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
.
.
.
(EngineCore_DP0 pid=270) INFO 12-03 08:31:17 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
It took me like 30min to load when gemm is enabled.
How about disabling it first, just to see if it works:
export VLLM_USE_DEEP_GEMM=0
vllm serve ...
But after all, I don't have sm120 device. Somehow this is still kinda new, compatibility issue, headache...
would you kow why get stuck in the end with this message repeating?
......
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:34.878000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) [rank0]:W1203 08:28:35.083000 404 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
(Worker_TP0_EP0 pid=404) INFO 12-03 08:28:39 [backends.py:288] Compiling a graph for dynamic shape takes 72.44 s
(EngineCore_DP0 pid=270) INFO 12-03 08:29:17 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=270) INFO 12-03 08:30:17 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=270) INFO 12-03 08:31:17 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
.
.
.
(EngineCore_DP0 pid=270) INFO 12-03 08:31:17 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
Model works much better with sglang in my dual blackwell setup:
Script
#!/bin/bash
CONTAINER_NAME="sglang-minimax-m2"
# Remove existing container
if docker ps -a --format 'table {{.Names}}' | grep -q "^${CONTAINER_NAME}$"; then
echo "Removing existing container: ${CONTAINER_NAME}"
docker rm -f ${CONTAINER_NAME}
fi
echo "Starting SGLang Docker container for Minimax M2 AWQ..."
docker run -it \
--name ${CONTAINER_NAME} \
--runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v /<path to models>/models:/models \
--add-host="host.docker.internal:host-gateway" \
-p 8000:8000 \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--show-time-cost \
--model-path /models/MiniMax-M2-AWQ \
--tokenizer-path /models/MiniMax-M2-AWQ \
--tensor-parallel-size 2 \
--context-length 196608 \
--host 0.0.0.0 \
--port 8000 \
--tool-call-parser minimax-m2 \
--reasoning-parser minimax
echo "Container started. Logs: docker logs -f ${CONTAINER_NAME}"
echo "API available at: http://localhost:8000/v1/chat/completions"

