Reward Models Enable Scalable Code Verification by Trading Accuracy for Throughput
Abstract
The standard paradigm for solving coding tasks via large language models (LLMs) is to generate-then-rank programs, where the latter step uses a verifier in the ranking process. The growing consensus is that a comprehensive verifier (e.g., a full test suite) should be prioritized over an outcome reward model (ORM) whenever possible, with little consideration given to the trade-offs involved. We aim to challenge this assumption by systematically exploring the tradeoff between speed and accuracy. We find that ORMs play a crucial role in scaling verification through trading accuracy for speed, even when a comprehensive verifier is available. Their value becomes especially apparent when used in a generate-prune-then-rank approach, where a faster but less accurate verifier removes incorrect solutions prior to ranking -- leading to a system that is 11.65x faster while only being 8.33% less accurate than the full test suite. We analyze the generate-prune-then-rank approach and show that it works by filtering out incorrect but highly ranked solutions. These findings enable the design of scalable and accurate program ranking systems.
Community
TL;DR: Outcome Reward Models for code verification allow one to trade accuracy for speed in the generate-then-rank paradigm. This tradeoff can be significantly improved through a generate-prune-then-rank approach. A weaker verifier eliminates candidates before ranking with an outcome reward model, thus saving work on incorrect tokens. We show that this hybrid approach can be 11.65 times faster than running the whole test suite while only being 8.33% less accurate.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Learning to Reason Across Parallel Samples for LLM Reasoning (2025)
- Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence (2025)
- HardTests: Synthesizing High-Quality Test Cases for LLM Coding (2025)
- Solve-Detect-Verify: Inference-Time Scaling with Flexible Generative Verifier (2025)
- Improving Assembly Code Performance with Large Language Models via Reinforcement Learning (2025)
- Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling (2025)
- Incentivizing LLMs to Self-Verify Their Answers (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper