Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following
Abstract
A self-supervised RL framework enhances instruction following in reasoning models without external supervision, maintaining reasoning performance and offering scalability and cost-effectiveness.
Reasoning models excel in complex problem solving but exhibit a concerning trade off between reasoning capabilities and instruction following abilities. Existing approaches for improving instruction following rely on stronger external models, creating methodological bottlenecks and practical limitations including increased costs and accessibility constraints. We propose a self-supervised RL framework that leverages reasoning models' own internal signals to improve instruction following capabilities without external supervision. Extensive experiments demonstrate that our framework significantly improves instruction following capabilities while maintaining reasoning performance, offering a scalable and cost-effective approach to enhance instruction following in reasoning models. The data and code are publicly available at https://github.com/Rainier-rq/verl-if.
Community
We introduce Self-Supervised Reinforcement Learning for Instruction Following, a framework that achieves superior instruction-following capabilities in reasoning models through reinforcement learning, eliminating the need for costly annotations or dependence on stronger external models.
Our framework incorporates three key innovations: (1) an incremental constraint curriculum that decomposes multi-constraint instructions into simpler instructions with incrementally increasing constraint numbers, enabling more stable RL training through denser reward signals, (2) a novel soft constraint reward modeling system that establishes semantic understanding without requiring external supervision, and (3) an efficient constraint-wise binary classification mechanism that enables scalable reward computation for RL training.
Experimental results demonstrate that our trained models consistently outperform baseline approaches across comprehensive instruction-following benchmarks while preserving superior reasoning capabilities. Analysis of training dynamics reveals that our incremental constraint curriculum generates denser reward signals during optimization compared to direct multi-constraint training methods. Furthermore, we demonstrate the critical importance of incorporating instruction-following-specific reasoning data during the cold-start phase, rather than relying exclusively on reasoning models saturated with mathematical and logical tasks. Code and data are publicly available at https://github.com/Rainier-rq/verl-if.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VerIF: Verification Engineering for Reinforcement Learning in Instruction Following (2025)
- Libra: Assessing and Improving Reward Model by Learning to Think (2025)
- Omni-Think: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards (2025)
- The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs (2025)
- From Sufficiency to Reflection: Reinforcement-Guided Thinking Quality in Retrieval-Augmented Reasoning for LLMs (2025)
- Multimodal Mathematical Reasoning with Diverse Solving Perspective (2025)
- ReasonGRM: Enhancing Generative Reward Models through Large Reasoning Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
arXiv explained breakdown of this paper ๐ https://arxivexplained.com/papers/beyond-the-trade-off-self-supervised-reinforcement-learning-for-reasoning-models-instruction-following
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper