arxiv:2509.11963

ToolRM: Outcome Reward Models for Tool-Calling Large Language Models

Published on Sep 15

· Submitted by

Mayank Agarwal on Sep 16

Upvote

Authors:

Mayank Agarwal ,

Kinjal Basu ,

Abstract

A benchmark and training framework for reward models in tool-calling scenarios improve performance and enable efficient fine-tuning through outcome-based evaluation.

AI-generated summary

As large language models (LLMs) increasingly interact with external tools, reward modeling for tool use has become a critical yet underexplored area. Existing reward models, trained primarily on natural language outputs, struggle to evaluate tool-based reasoning and execution. To quantify this gap, we introduce FC-RewardBench, the first benchmark designed to systematically assess reward models' performance in tool-calling scenarios. Our analysis shows that current reward models often miss key signals of effective tool use, highlighting the need for domain-specific modeling. To address this, we propose a training framework for outcome-based reward models using data synthesized from permissively licensed, open-weight LLMs. We train models ranging from 1.7B to 14B parameters and evaluate them across seven out-of-domain benchmarks. These models consistently outperform general-purpose baselines, achieving up to 25\% average improvement in downstream task performance and enabling data-efficient fine-tuning through reward-guided filtering.

View arXiv page View PDF Add to collection

Community

mayankagarwal

Paper author Paper submitter 20 days ago

We study reward modeling for tool‑calling LLMs and find that general‑purpose reward models, trained largely on free‑text outputs, often miss signals needed to judge tool‑call correctness and execution. To quantify this gap, we introduce FC‑RewardBench, a benchmark derived from BFCL‑v3 with paired correct/incorrect tool‑call sequences that capture subtle failures (e.g., incorrect parameter values, missing/extra calls). We then propose ToolRM, a suite of outcome reward models (1.7B–14B) trained on synthetic preference data from permissively licensed, open‑weight function‑calling models. Across seven out‑of‑domain benchmarks, ToolRM outperforms strong reward models and LLMs‑as‑judges while being more compute‑efficient, and in Best‑of‑n inference yields up to 25% average improvement in tool‑calling accuracy, with especially large gains for smaller generators. ToolRM also enables effective data filtering, producing fine‑tuned models that match or exceed full‑data training using only half the data.

librarian-bot

20 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.11963 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.11963 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.