Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance
Abstract
Proximal Policy Optimization (PPO)-based Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human preferences. It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance. This approach increases computational complexity and instability due to actor-critic interdependence. Additionally, PPO lacks access to true environment rewards in LLM tasks, limiting its adaptability. Under such conditions, pretraining a value model or a reward model becomes equivalent, as both provide fixed supervisory signals without new ground-truth feedback. To address these issues, we propose Decoupled Value Policy Optimization (DVPO), a lean framework that replaces traditional reward modeling with a pretrained global value model (GVM). The GVM is conditioned on policy trajectories and predicts token-level return-to-go estimates. By decoupling value model from policy training (via frozen GVM-driven RL objectives), DVPO eliminates actor-critic interdependence, reducing GPU memory usage by 40\% and training time by 35\% compared to conventional RLHF. Experiments across benchmarks show DVPO outperforms efficient RLHF methods (e.g., DPO) while matching state-of-the-art PPO in performance.
Community
We tackle the computational and stability challenges of traditional PPO-based RLHF by introducing Decoupled Value Policy Optimization (DVPO). Our approach pretrains a Global Value Model (GVM) to predict token-level return-to-go values from policy trajectories, eliminating the need for joint actor-critic training while preserving fine-grained reward supervision. Theoretically, we show that without new reward feedback, pretraining a reward model and a GVM are equivalent. Experiments on benchmarks like MT-Bench, Alpaca-Eval, and Arena-Hard demonstrate that DVPO matches state-of-the-art performance while reducing training time and GPU usage by approximately 40% and 35%, respectively.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Simplify RLHF as Reward-Weighted SFT: A Variational Method (2025)
- Process Reinforcement through Implicit Rewards (2025)
- Reward Shaping to Mitigate Reward Hacking in RLHF (2025)
- Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models (2025)
- Improving LLM General Preference Alignment via Optimistic Online Mirror Descent (2025)
- IPO: Your Language Model is Secretly a Preference Classifier (2025)
- Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper