Angles Don't Lie: Unlocking Training-Efficient RL Through the Model's Own Signals
Abstract
GAIN-RL leverages angle concentration signals to improve training efficiency and data efficiency in Reinforcement Fine-tuning of Large Language Models.
Current Reinforcement Fine-tuning (RFT) paradigms for Large Language Models (LLMs) suffer from sample inefficiency due to the redundant exposure of identical queries under uniform data sampling. While previous work has explored curriculum learning via heuristic difficulty metrics, these strategies exhibit limitations by neglecting the intrinsic learning signals generated by the model itself, thus leading to suboptimal training regimes. In this paper, we identify a model-inherent signal termed angle concentration that effectively reflects an LLM's capacity to learn from specific data. We theoretically and empirically demonstrate a correlation between the angular distribution of token hidden state vectors and the resulting gradient, revealing a learning preference for data exhibiting higher angle concentration. Inspired by this finding, we propose GAIN-RL, a Gradient-driven Angle-Informed Navigated RL framework. By leveraging the model's intrinsic angle concentration signal, GAIN-RL dynamically selects training data in each epoch, ensuring consistently impactful gradient updates and thus significantly enhancing overall training efficiency. Empirical evaluations show that GAIN-RL (GRPO) achieves over a 2.5x acceleration in training efficiency across diverse mathematical and coding tasks and varying model scales. Furthermore, GAIN-RL (GRPO)'s efficient sampling yields data-efficient training, achieving better performance with half the original data compared to vanilla GRPO with full training data. Code is realsed at https://github.com/wangqinsi1/GAINRL/tree/main.
Community
In this paper, We show that the angle concentration of hidden‑state vectors is an intrinsic indicator of how much an LLM can learn from a sample, tightly correlating with gradient strength. Leveraging this signal, GAIN‑RL dynamically selects the most informative examples each epoch, keeping gradient updates impactful and slashing the sample waste that plagues standard RFT. On diverse math and coding benchmarks—and at multiple model scales—GAIN‑RL delivers >2.5× faster training and beats vanilla GRPO with just half the original data.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Behavior Injection: Preparing Language Models for Reinforcement Learning (2025)
- Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs (2025)
- Solver-Informed RL: Grounding Large Language Models for Authentic Optimization Modeling (2025)
- On-Policy RL with Optimal Reward Baseline (2025)
- Step-wise Adaptive Integration of Supervised Fine-tuning and Reinforcement Learning for Task-Specific LLMs (2025)
- Synthetic Data RL: Task Definition Is All You Need (2025)
- RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper