arxiv:2508.02150

Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following

Published on Aug 4

· Submitted by

Abbey4799 on Aug 5

Upvote

Authors:

Qingyu Ren ,

Abstract

A self-supervised RL framework enhances instruction following in reasoning models without external supervision, maintaining reasoning performance and offering scalability and cost-effectiveness.

AI-generated summary

Reasoning models excel in complex problem solving but exhibit a concerning trade off between reasoning capabilities and instruction following abilities. Existing approaches for improving instruction following rely on stronger external models, creating methodological bottlenecks and practical limitations including increased costs and accessibility constraints. We propose a self-supervised RL framework that leverages reasoning models' own internal signals to improve instruction following capabilities without external supervision. Extensive experiments demonstrate that our framework significantly improves instruction following capabilities while maintaining reasoning performance, offering a scalable and cost-effective approach to enhance instruction following in reasoning models. The data and code are publicly available at https://github.com/Rainier-rq/verl-if.

View arXiv page View PDF GitHub 20 Add to collection

Community

Abbey4799

Paper submitter 17 days ago

We introduce Self-Supervised Reinforcement Learning for Instruction Following, a framework that achieves superior instruction-following capabilities in reasoning models through reinforcement learning, eliminating the need for costly annotations or dependence on stronger external models.

Our framework incorporates three key innovations: (1) an incremental constraint curriculum that decomposes multi-constraint instructions into simpler instructions with incrementally increasing constraint numbers, enabling more stable RL training through denser reward signals, (2) a novel soft constraint reward modeling system that establishes semantic understanding without requiring external supervision, and (3) an efficient constraint-wise binary classification mechanism that enables scalable reward computation for RL training.

Experimental results demonstrate that our trained models consistently outperform baseline approaches across comprehensive instruction-following benchmarks while preserving superior reasoning capabilities. Analysis of training dynamics reveals that our incremental constraint curriculum generates denser reward signals during optimization compared to direct multi-constraint training methods. Furthermore, we demonstrate the critical importance of incorporating instruction-following-specific reasoning data during the cold-start phase, rather than relying exclusively on reasoning models saturated with mathematical and logical tasks. Code and data are publicly available at https://github.com/Rainier-rq/verl-if.