arxiv:2510.05040

Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts

Published on Oct 6

· Submitted by

Amrit Singh Bedi on Oct 7

· University of Central Florida

Upvote

Authors:

Amrit Singh Bedi

Abstract

HEX, a training-free inference method for diffusion-based large language models, ensembles diverse generation paths to improve accuracy across various reasoning benchmarks without additional training.

AI-generated summary

Diffusion-based large language models (dLLMs) are trained flexibly to model extreme dependence in the data distribution; however, how to best utilize this information at inference time remains an open problem. In this work, we uncover an interesting property of these models: dLLMs trained on textual data implicitly learn a mixture of semi-autoregressive experts, where different generation orders reveal different specialized behaviors. We show that committing to any single, fixed inference time schedule, a common practice, collapses performance by failing to leverage this latent ensemble. To address this, we introduce HEX (Hidden semiautoregressive EXperts for test-time scaling), a training-free inference method that ensembles across heterogeneous block schedules. By doing a majority vote over diverse block-sized generation paths, HEX robustly avoids failure modes associated with any single fixed schedule. On reasoning benchmarks such as GSM8K, it boosts accuracy by up to 3.56X (from 24.72% to 88.10%), outperforming top-K margin inference and specialized fine-tuned methods like GRPO, without additional training. HEX even yields significant gains on MATH benchmark from 16.40% to 40.00%, scientific reasoning on ARC-C from 54.18% to 87.80%, and TruthfulQA from 28.36% to 57.46%. Our results establish a new paradigm for test-time scaling in diffusion-based LLMs (dLLMs), revealing that the sequence in which masking is performed plays a critical role in determining performance during inference.

View arXiv page View PDF Add to collection

Community

amrit0714

Paper author Paper submitter 8 days ago

🚀 A new dimension of test-time scaling in Diffusion LLMs!

Diffusion-based LLMs are incredibly flexible; they can decode tokens in any order. But during training, they don’t just learn one reasoning strategy…

🧩 They implicitly learn a mixture of hidden semi-autoregressive experts, each favoring different token orders and reasoning paths.

In our latest work, we introduce HEX (Hidden Semi-Autoregressive Experts), an inference algorithm that utilizes this hidden flexibility to improve performance (even outperforming GRPO fine-tuned models).

🔗 Read more:
🧵 Detailed thread → https://lnkd.in/eZ-DZrTA

🌐 Project page → https://lnkd.in/ehbRvpUb