Papers
arxiv:2502.16944

Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance

Published on Feb 24
· Submitted by keanudicap on Feb 28
Authors:
,
,

Abstract

Proximal Policy Optimization (PPO)-based Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human preferences. It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance. This approach increases computational complexity and instability due to actor-critic interdependence. Additionally, PPO lacks access to true environment rewards in LLM tasks, limiting its adaptability. Under such conditions, pretraining a value model or a reward model becomes equivalent, as both provide fixed supervisory signals without new ground-truth feedback. To address these issues, we propose Decoupled Value Policy Optimization (DVPO), a lean framework that replaces traditional reward modeling with a pretrained global value model (GVM). The GVM is conditioned on policy trajectories and predicts token-level return-to-go estimates. By decoupling value model from policy training (via frozen GVM-driven RL objectives), DVPO eliminates actor-critic interdependence, reducing GPU memory usage by 40\% and training time by 35\% compared to conventional RLHF. Experiments across benchmarks show DVPO outperforms efficient RLHF methods (e.g., DPO) while matching state-of-the-art PPO in performance.

Community

Paper author Paper submitter

We tackle the computational and stability challenges of traditional PPO-based RLHF by introducing Decoupled Value Policy Optimization (DVPO). Our approach pretrains a Global Value Model (GVM) to predict token-level return-to-go values from policy trajectories, eliminating the need for joint actor-critic training while preserving fine-grained reward supervision. Theoretically, we show that without new reward feedback, pretraining a reward model and a GVM are equivalent. Experiments on benchmarks like MT-Bench, Alpaca-Eval, and Arena-Hard demonstrate that DVPO matches state-of-the-art performance while reducing training time and GPU usage by approximately 40% and 35%, respectively.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.16944 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.16944 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.16944 in a Space README.md to link it from this page.

Collections including this paper 2