SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models
Abstract
The Sandwiched Policy Gradient method improves reinforcement learning for diffusion large language models by using both upper and lower bounds of log-likelihood, outperforming ELBO-based methods.
Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.
Community
tl dr: We propose a new policy gradient algorithm for dLLMs, SPG, which reduces bias by optimizing sandwiched variational bounds based on reward and utilizes a block-wise masking technique to improve training efficiency and stability. SPG achieves state-of-the-art RL performance on mathematical and logical reasoning benchmarks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization (2025)
- Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models (2025)
- d2: Improved Techniques for Training Reasoning Diffusion Language Models (2025)
- Principled and Tractable RL for Reasoning with Diffusion Language Models (2025)
- Inpainting-Guided Policy Optimization for Diffusion Large Language Models (2025)
- DiFFPO: Training Diffusion LLMs to Reason Fast and Furious via Reinforcement Learning (2025)
- RFG: Test-Time Scaling for Diffusion Large Language Model Reasoning with Reward-Free Guidance (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper