arxiv:2504.16828

Process Reward Models That Think

Published on Apr 23

· Submitted by

Authors:

Abstract

Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers -- using only 1% of the process labels in PRM800K -- across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME '24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation on a subset of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. Our work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training. Our code, data, and models will be released at https://github.com/mukhal/thinkprm.

View arXiv page View PDF GitHub repository Add to collection

Community

mkhalifa

Paper submitter 11 days ago

TLDR; we tackle the challenge of expensive step-level supervision required for training PRMs via ThinkPRM, a generative PRM fine-tuned with only 8K process labels, enabling it to verify reasoning using long chains-of-thought.

Github: https://github.com/mukhal/thinkprm
Our trained verifiers: ThinkPRM-14B, ThinkPRM-1.5B
1K synthetic verification CoTs used for verifier training: https://huggingface.co/datasets/launch/thinkprm-1K-verification-cots

librarian-bot

10 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Ritvik19

7 days ago

@mkhalifa Really impressive work!!

I noticed you’ve fine-tuned 1.5B and 14B models. Have you explored (or considered) training the 7B model as a middle ground in terms of capability and compute efficiency? I imagine it could be a practical sweet spot for many real-world applications.

Also, I was curious if you’ve considered applying rejection sampling or filtering using ThinkPRM on a larger synthetic dataset (beyond the 1K verification CoTs used for training)? Could be interesting to see if this self-bootstrapping approach helps scale up PRM quality without needing more process-labeled data.

mkhalifa

6 days ago

@Ritvik19 thank you for your interest!

We did finetune the 7B on the same 1K CoTs, but surprisingly its performance on ProcessBench wasn't much better compared to the 1.5B model so we decided not to include it in the paper. But based on your suggestion and other comments we got, we will release the 7B model soon.
This is a cool idea, and our training can be thought of as a single round of sample + filter + train. Further iterations---where filtering can be done with trained model from previous iteration---could boost performance even more, but we haven't tried that due to compute constraints.

Let me know if you have any more questions!