Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts
Abstract
HEX, a training-free inference method for diffusion-based large language models, ensembles diverse generation paths to improve accuracy across various reasoning benchmarks without additional training.
Diffusion-based large language models (dLLMs) are trained flexibly to model extreme dependence in the data distribution; however, how to best utilize this information at inference time remains an open problem. In this work, we uncover an interesting property of these models: dLLMs trained on textual data implicitly learn a mixture of semi-autoregressive experts, where different generation orders reveal different specialized behaviors. We show that committing to any single, fixed inference time schedule, a common practice, collapses performance by failing to leverage this latent ensemble. To address this, we introduce HEX (Hidden semiautoregressive EXperts for test-time scaling), a training-free inference method that ensembles across heterogeneous block schedules. By doing a majority vote over diverse block-sized generation paths, HEX robustly avoids failure modes associated with any single fixed schedule. On reasoning benchmarks such as GSM8K, it boosts accuracy by up to 3.56X (from 24.72% to 88.10%), outperforming top-K margin inference and specialized fine-tuned methods like GRPO, without additional training. HEX even yields significant gains on MATH benchmark from 16.40% to 40.00%, scientific reasoning on ARC-C from 54.18% to 87.80%, and TruthfulQA from 28.36% to 57.46%. Our results establish a new paradigm for test-time scaling in diffusion-based LLMs (dLLMs), revealing that the sequence in which masking is performed plays a critical role in determining performance during inference.
Community
🚀 A new dimension of test-time scaling in Diffusion LLMs!
Diffusion-based LLMs are incredibly flexible; they can decode tokens in any order. But during training, they don’t just learn one reasoning strategy…
🧩 They implicitly learn a mixture of hidden semi-autoregressive experts, each favoring different token orders and reasoning paths.
In our latest work, we introduce HEX (Hidden Semi-Autoregressive Experts), an inference algorithm that utilizes this hidden flexibility to improve performance (even outperforming GRPO fine-tuned models).
🔗 Read more:
🧵 Detailed thread → https://lnkd.in/eZ-DZrTA
🌐 Project page → https://lnkd.in/ehbRvpUb
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models (2025)
- Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models (2025)
- Fast-dLLM v2: Efficient Block-Diffusion LLM (2025)
- AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size (2025)
- RFG: Test-Time Scaling for Diffusion Large Language Model Reasoning with Reward-Free Guidance (2025)
- Diffusion Language Models Know the Answer Before Decoding (2025)
- Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper