On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting
Abstract
CHORD integrates supervised fine-tuning and reinforcement learning by dynamically weighting off-policy and on-policy data to improve model stability and performance.
Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established model patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framework for the Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting, which reframes SFT not as a separate stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Based on an analysis of off-policy expert data's influence at both holistic and granular levels, we incorporate a dual-control mechanism in CHORD. Specifically, the framework first employs a global coefficient to holistically guide the transition from off-policy imitation to on-policy exploration, and then applies a token-wise weighting function that enables granular learning from expert tokens, which preserves on-policy exploration and mitigates disruption from off-policy data. We conduct extensive experiments on widely used benchmarks, providing empirical evidence that CHORD achieves a stable and efficient learning process. By effectively harmonizing off-policy expert data with on-policy exploration, CHORD demonstrates significant improvements over baselines. We release the implementation at https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord to inspire further research.
Community
Github: https://github.com/modelscope/Trinity-RFT
Welcome to try our Trinity-RFT framework! The CHORD framework proposed here provides analytical insights into the challenges of combining SFT/RL from an off-policy vs on-policy perspective. We hope this work can serve as a catalyst for further discussions and inspire more exploration within the community!
The original qwen 2.5 blog gives the 7b instruct mmlu pro score as 56.3
In this paper you state the original qwen 2.5 7b instruct mmlu pro score as 24.7 and then you state with chord you managed to raise it to 56.2
I'm confused.
Great catch! The difference comes down to the evaluation prompt.
The Qwen blog's 56.3 score was achieved with 5-shot CoT prompting. For our research, we deliberately used a zero-shot prompt template with tags (shown in the appendix), rather than benchmark-specific few-shot prompts.
Since MMLU-Pro contains diverse tasks (mathematics, physics, etc.), we wanted to observe the improvements brought by learning to reason through SFT and RL. The improvement from 24.7 to 56.2 in this setting demonstrates the effectiveness of our method.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning (2025)
- Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling (2025)
- GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning (2025)
- Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model (2025)
- RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization (2025)
- AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance (2025)
- On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper