SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment
Abstract
Stable rank, an intrinsic quality signal derived from model representations, improves LLM alignment with human preferences through reinforcement learning without external supervision.
Aligning Large Language Models (LLMs) with human preferences typically relies on external supervision, which faces critical limitations: human annotations are scarce and subjective, reward models are vulnerable to reward hacking, and self-evaluation methods suffer from prompt sensitivity and biases. In this work, we propose stable rank, an intrinsic, annotation-free quality signal derived from model representations. Stable rank measures the effective dimensionality of hidden states by computing the ratio of total variance to dominant-direction variance, capturing quality through how information distributes across representation dimensions. Empirically, stable rank achieves 84.04% accuracy on RewardBench and improves task accuracy by an average of 11.3 percentage points over greedy decoding via Best-of-N sampling. Leveraging this insight, we introduce Stable Rank Group Relative Policy Optimization (SR-GRPO), which uses stable rank as a reward signal for reinforcement learning. Without external supervision, SR-GRPO improves Qwen2.5-1.5B-Instruct by 10% on STEM and 19% on mathematical reasoning, outperforming both learned reward models and self-evaluation baselines. Our findings demonstrate that quality signals can be extracted from internal model geometry, offering a path toward scalable alignment without external supervision.
Community
🤯 We know RLHF relies heavily on external rewards. But what if the model already knows when it's reasoning well?
The paper SR-GRPO introduces a simple intrinsic metric, stable rank, which serves as an annotation-free quality signal for LLM output. It measures the richness and coherence of the hidden states.
It is like checking the complexity of the model's brain signal to see if it is reasoning clearly.
The result: The SR-GRPO alignment method significantly boosts performance on challenging mathematical reasoning and STEM tasks, eliminating the need for expensive human preference data.
Just look inside! 🧐
Paper: SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GEM: Generative Entropy-Guided Preference Modeling for Few-shot Alignment of LLMs (2025)
- PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling (2025)
- OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment (2025)
- Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment (2025)
- From to : Multidimensional Supervision of Reasoning Process for LLM Optimization (2025)
- Reinforced Preference Optimization for Recommendation (2025)
- Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper