Rethinking Reward Models for Multi-Domain Test-Time Scaling
Abstract
A unified evaluation across diverse domains shows that generative outcome reward models outperform process reward models and discriminative outcome reward models in assessing large language model reliability.
The reliability of large language models (LLMs) during test-time scaling is often assessed with external verifiers or reward models that distinguish correct reasoning from flawed logic. Prior work generally assumes that process reward models (PRMs), which score every intermediate reasoning step, outperform outcome reward models (ORMs) that assess only the final answer. This view is based mainly on evidence from narrow, math-adjacent domains. We present the first unified evaluation of four reward model variants, discriminative ORM and PRM (\DisORM, \DisPRM) and generative ORM and PRM (\GenORM, \GenPRM), across 14 diverse domains. Contrary to conventional wisdom, we find that (i) \DisORM performs on par with \DisPRM, (ii) \GenPRM is not competitive, and (iii) overall, \GenORM is the most robust, yielding significant and consistent gains across every tested domain. We attribute this to PRM-style stepwise scoring, which inherits label noise from LLM auto-labeling and has difficulty evaluating long reasoning trajectories, including those involving self-correcting reasoning. Our theoretical analysis shows that step-wise aggregation compounds errors as reasoning length grows, and our empirical observations confirm this effect. These findings challenge the prevailing assumption that fine-grained supervision is always better and support generative outcome verification for multi-domain deployment. We publicly release our code, datasets, and checkpoints at https://github.com/db-Lee/Multi-RM{\small\texttt{https://github.com/db-Lee/Multi-RM}} to facilitate future research in multi-domain settings.
Community
Rethinking Reward Models for Multi-Domain Test-Time Scaling
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ContextPRM: Leveraging Contextual Coherence for multi-domain Test-Time Scaling (2025)
- Trust but Verify! A Survey on Verification Design for Test-time Scaling (2025)
- Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS (2025)
- Logical Reasoning with Outcome Reward Models for Test-Time Scaling (2025)
- Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned (2025)
- From Faithfulness to Correctness: Generative Reward Models that Think Critically (2025)
- Hybrid Reward Normalization for Process-supervised Non-verifiable Agentic Tasks (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 12
Browse 12 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper