arxiv:2503.22233

Process Reward Modeling with Entropy-Driven Uncertainty

Published on Mar 28

Authors:

Abstract

This paper presents the Entropy-Driven Unified Process Reward Model (EDU-PRM), a novel framework that approximates state-of-the-art performance in process supervision while drastically reducing training costs. EDU-PRM introduces an entropy-guided dynamic step partitioning mechanism, using logit distribution entropy to pinpoint high-uncertainty regions during token generation dynamically. This self-assessment capability enables precise step-level feedback without manual fine-grained annotation, addressing a critical challenge in process supervision. Experiments on the Qwen2.5-72B model with only 7,500 EDU-PRM-generated training queries demonstrate accuracy closely approximating the full Qwen2.5-72B-PRM (71.1% vs. 71.6%), achieving a 98% reduction in query cost compared to prior methods. This work establishes EDU-PRM as an efficient approach for scalable process reward model training.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2503.22233 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2503.22233 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2503.22233 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.