Papers
arxiv:2508.11408

On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting

Published on Aug 15
· Submitted by xiaoniqiu on Aug 21
Authors:
,
,
,
,
,
,

Abstract

CHORD integrates supervised fine-tuning and reinforcement learning by dynamically weighting off-policy and on-policy data to improve model stability and performance.

AI-generated summary

Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established model patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framework for the Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting, which reframes SFT not as a separate stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Based on an analysis of off-policy expert data's influence at both holistic and granular levels, we incorporate a dual-control mechanism in CHORD. Specifically, the framework first employs a global coefficient to holistically guide the transition from off-policy imitation to on-policy exploration, and then applies a token-wise weighting function that enables granular learning from expert tokens, which preserves on-policy exploration and mitigates disruption from off-policy data. We conduct extensive experiments on widely used benchmarks, providing empirical evidence that CHORD achieves a stable and efficient learning process. By effectively harmonizing off-policy expert data with on-policy exploration, CHORD demonstrates significant improvements over baselines. We release the implementation at https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord to inspire further research.

Community

Paper author Paper submitter

Github: https://github.com/modelscope/Trinity-RFT

Welcome to try our Trinity-RFT framework! The CHORD framework proposed here provides analytical insights into the challenges of combining SFT/RL from an off-policy vs on-policy perspective. We hope this work can serve as a catalyst for further discussions and inspire more exploration within the community!

The original qwen 2.5 blog gives the 7b instruct mmlu pro score as 56.3

In this paper you state the original qwen 2.5 7b instruct mmlu pro score as 24.7 and then you state with chord you managed to raise it to 56.2

I'm confused.

·

Great catch! The difference comes down to the evaluation prompt.
The Qwen blog's 56.3 score was achieved with 5-shot CoT prompting. For our research, we deliberately used a zero-shot prompt template with tags (shown in the appendix), rather than benchmark-specific few-shot prompts.
Since MMLU-Pro contains diverse tasks (mathematics, physics, etc.), we wanted to observe the improvements brought by learning to reason through SFT and RL. The improvement from 24.7 to 56.2 in this setting demonstrates the effectiveness of our method.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.11408 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.11408 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.11408 in a Space README.md to link it from this page.

Collections including this paper 2