Papers
arxiv:2508.02150

Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following

Published on Aug 4
ยท Submitted by Abbey4799 on Aug 5
Authors:
,
,
,
,
,
,
,

Abstract

A self-supervised RL framework enhances instruction following in reasoning models without external supervision, maintaining reasoning performance and offering scalability and cost-effectiveness.

AI-generated summary

Reasoning models excel in complex problem solving but exhibit a concerning trade off between reasoning capabilities and instruction following abilities. Existing approaches for improving instruction following rely on stronger external models, creating methodological bottlenecks and practical limitations including increased costs and accessibility constraints. We propose a self-supervised RL framework that leverages reasoning models' own internal signals to improve instruction following capabilities without external supervision. Extensive experiments demonstrate that our framework significantly improves instruction following capabilities while maintaining reasoning performance, offering a scalable and cost-effective approach to enhance instruction following in reasoning models. The data and code are publicly available at https://github.com/Rainier-rq/verl-if.

Community

Paper submitter

We introduce Self-Supervised Reinforcement Learning for Instruction Following, a framework that achieves superior instruction-following capabilities in reasoning models through reinforcement learning, eliminating the need for costly annotations or dependence on stronger external models.

Our framework incorporates three key innovations: (1) an incremental constraint curriculum that decomposes multi-constraint instructions into simpler instructions with incrementally increasing constraint numbers, enabling more stable RL training through denser reward signals, (2) a novel soft constraint reward modeling system that establishes semantic understanding without requiring external supervision, and (3) an efficient constraint-wise binary classification mechanism that enables scalable reward computation for RL training.

Experimental results demonstrate that our trained models consistently outperform baseline approaches across comprehensive instruction-following benchmarks while preserving superior reasoning capabilities. Analysis of training dynamics reveals that our incremental constraint curriculum generates denser reward signals during optimization compared to direct multi-constraint training methods. Furthermore, we demonstrate the critical importance of incorporating instruction-following-specific reasoning data during the cold-start phase, rather than relying exclusively on reasoning models saturated with mathematical and logical tasks. Code and data are publicly available at https://github.com/Rainier-rq/verl-if.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.02150 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.02150 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.02150 in a Space README.md to link it from this page.

Collections including this paper 6