Papers
arxiv:2509.04419

Towards a Unified View of Large Language Model Post-Training

Published on Sep 4
· Submitted by XingtaiHF on Sep 5
#3 Paper of the day
Authors:
,
,
,
,
,
,
,

Abstract

A unified policy gradient estimator and Hybrid Post-Training algorithm effectively combine online and offline data for post-training language models, improving performance across various benchmarks.

AI-generated summary

Two major sources of training data exist for post-training modern language models: online (model-generated rollouts) data, and offline (human or other-model demonstrations) data. These two types of data are typically used by approaches like Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT), respectively. In this paper, we show that these approaches are not in contradiction, but are instances of a single optimization process. We derive a Unified Policy Gradient Estimator, and present the calculations of a wide spectrum of post-training approaches as the gradient of a common objective under different data distribution assumptions and various bias-variance tradeoffs. The gradient estimator is constructed with four interchangeable parts: stabilization mask, reference policy denominator, advantage estimate, and likelihood gradient. Motivated by our theoretical findings, we propose Hybrid Post-Training (HPT), an algorithm that dynamically selects different training signals. HPT is designed to yield both effective exploitation of demonstration and stable exploration without sacrificing learned reasoning patterns. We provide extensive experiments and ablation studies to verify the effectiveness of our unified theoretical framework and HPT. Across six mathematical reasoning benchmarks and two out-of-distribution suites, HPT consistently surpasses strong baselines across models of varying scales and families.

Community

Paper author Paper submitter

Do SFT and HPT use the same computation and data amount in the benchmark results? Or is your finding that there shouldn't be significant differences with the same computation and data, because it is the same optimization process?

·
Paper author

Thanks for your question and your interest in our work. HPT integrates SFT and RL, and we show that SFT and RL objectives can be optimized jointly within a single loss.

As discussed in Section 3.3, while all algorithms share the same Common Objective, bias-variance trade-offs still exist across current instances for different components of the unified gradient estimator. Accordingly, we do not claim that there will be no significant differences across algorithms given the same compute and data, and meaningful differences can certainly arise.

We follow the setup of our main baseline, LUFFY (arXiv:2504.14945): for SFT we train for 3 epochs on ~46k examples (≈138k example passes). For HPT, we run 500 optimization steps with a batch size of 128, totaling ~64k examples. Because HPT dynamically switches between SFT and RL, its training budget does not exceed that of the RL configuration (which also uses 500 steps). The apparent gap relative to the SFT setting is essentially the inherent RL-vs-SFT compute difference; given their distinct learning dynamics, it is not customary to compare their compute budgets head-to-head.

We hope this clarifies our intent and settings, and we’re happy to share more details or additional ablations if helpful.

Paper author Paper submitter
This comment has been hidden (marked as Resolved)

Nice paper! @XingtaiHF Feel free to claim it with your HF account by clicking your name on the paper page!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.04419 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.04419 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.04419 in a Space README.md to link it from this page.

Collections including this paper 8