ToolRM: Outcome Reward Models for Tool-Calling Large Language Models
Abstract
A benchmark and training framework for reward models in tool-calling scenarios improve performance and enable efficient fine-tuning through outcome-based evaluation.
As large language models (LLMs) increasingly interact with external tools, reward modeling for tool use has become a critical yet underexplored area. Existing reward models, trained primarily on natural language outputs, struggle to evaluate tool-based reasoning and execution. To quantify this gap, we introduce FC-RewardBench, the first benchmark designed to systematically assess reward models' performance in tool-calling scenarios. Our analysis shows that current reward models often miss key signals of effective tool use, highlighting the need for domain-specific modeling. To address this, we propose a training framework for outcome-based reward models using data synthesized from permissively licensed, open-weight LLMs. We train models ranging from 1.7B to 14B parameters and evaluate them across seven out-of-domain benchmarks. These models consistently outperform general-purpose baselines, achieving up to 25\% average improvement in downstream task performance and enabling data-efficient fine-tuning through reward-guided filtering.
Community
We study reward modeling for tool‑calling LLMs and find that general‑purpose reward models, trained largely on free‑text outputs, often miss signals needed to judge tool‑call correctness and execution. To quantify this gap, we introduce FC‑RewardBench, a benchmark derived from BFCL‑v3 with paired correct/incorrect tool‑call sequences that capture subtle failures (e.g., incorrect parameter values, missing/extra calls). We then propose ToolRM, a suite of outcome reward models (1.7B–14B) trained on synthetic preference data from permissively licensed, open‑weight function‑calling models. Across seven out‑of‑domain benchmarks, ToolRM outperforms strong reward models and LLMs‑as‑judges while being more compute‑efficient, and in Best‑of‑n inference yields up to 25% average improvement in tool‑calling accuracy, with especially large gains for smaller generators. ToolRM also enables effective data filtering, producing fine‑tuned models that match or exceed full‑data training using only half the data.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Libra: Assessing and Improving Reward Model by Learning to Think (2025)
- Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models (2025)
- Posterior-GRPO: Rewarding Reasoning Processes in Code Generation (2025)
- Exploring Superior Function Calls via Reinforcement Learning (2025)
- GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning (2025)
- URPO: A Unified Reward & Policy Optimization Framework for Large Language Models (2025)
- Tool-integrated Reinforcement Learning for Repo Deep Search (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper