README.md · qualifire/mcp-tool-use-quality-ranger-0.6b at main

mcp-tool-use-quality-ranger-0.6b / README.md

oran-qualifire

Update README.md

f8f6f67 verified about 1 month ago

preview code

raw

history blame contribute delete

7.75 kB

metadata

tools:
  - en
license: other
tags:
  - mcp
  - tools quality
  - tool quality selection
  - tool selection
  - TSQ
  - TUQ
  - sequence-classification
  - tool-evaluation
  - function-call
  - limbic
  - tool use
  - tool quality
pipeline_tag: text-classification
language:
  - en
base_model:
  - Qwen/Qwen3-0.6B

🧠 Model Description

Designed for evaluating function calls in the context of Model Context Protocol (MCP) tools.
It can assess whether a function call is correct, uses the wrong tool, has incorrect parameter names, or has incorrect parameter values.

The mcp-tool-use-quality-ranger-0.6b is a fine-tuned sequence classification model created to evaluate the quality of function calls in conversational AI systems.

Max Context Length: 32,768 Tokens

It determines if a given function call:

Selects the correct tool
Has correct parameter names and structure
Contains correct parameter values

It produces one of four possible classification labels:

Label	Meaning
VALID_CALL	✅ The tool name, parameters, and values are all correct, or no suitable tool exists and no function call is made.
TOOL_ERROR	❌ The tool name does not exist or does not match the user intent.
PARAM_NAME_ERROR	❌ The correct tool is used, but parameter names are missing, extra, or incorrect.
PARAM_VALUE_ERROR	❌ Tool and parameter names are correct, but parameter values are wrong or incorrectly formatted.

🔽 Quantized Version

🪶 GGUF: qualifire/mcp-tool-use-quality-ranger-0.6b-GGUF

📊 Benchmark Evaluation

The mcp-tool-use-quality-ranger-0.6b was evaluated in a binary classification setting,
where the prediction is Correct if the function call evaluation matched the gold label, and Incorrect otherwise.

Model	#Params	Avg. Latency	Avg Binary Accuracy	Qualifire mcp-tool-use-quality Benchmark Binary Accuracy	Limbic Benchmark Binary Accuracy
qualifire/mcp-tool-use-quality-ranger-4b [private]	4B	0.30[sec]	0.962	0.971	0.954
qualifire/mcp-tool-use-quality-ranger-0.6b	0.6B	0.09[sec]	0.928	0.949	0.907
gemini-2.5-flash	-	4.87[sec]	0.858	0.871	0.845
quotientai/limbic-tool-use-0.5B-32K	0.5B	0.79[sec]	0.798	0.708	0.887

📌 Metrics Definitions

Avg. Binary Accuracy – Mean accuracy across all evaluated benchmarks,
where predictions are mapped to binary outcomes as follows:
- Qualifire TUQ Benchmark
  - Correct → VALID_CALL
  - Incorrect → TOOL_ERROR, PARAM_NAME_ERROR or PARAM_VALUE_ERROR
- Limbic Benchmark
  - Correct → correct
  - Incorrect → incorrect_tool, incorrect_parameter_names or incorrect_parameter_values
Qualifire TUQ Benchmark link – Qualifire Tool Selection Quality Benchmark.
Limbic Benchmark link – Limbic Eval Tool Use MCP Benchmark.

📜 Evaluation Prompt Template

The model uses the following structured evaluation process:

TOOL SELECTION
- Check if the tool name exists in available_tools
- Check if tool purpose matches user intent
- Fail → TOOL_ERROR❌
PARAMETER STRUCTURE
- All required parameters are present
- No extra parameters
- Parameter names exactly match the schema
- Fail → PARAM_NAME_ERROR❌
PARAMETER VALUES
- Values have correct data types
- Values match user request
- No fabricated or incorrect values
- Fail → PARAM_VALUE_ERROR❌

If all checks pass → VALID_CALL✅

📦 Requirements

transformers>=4.51.0
huggingface_hub
torch

💻 Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
import torch
from huggingface_hub import hf_hub_download

# Model name
model_name = "qualifire/mcp-tool-use-quality-ranger-0.6b"

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Create pipeline
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Load prompt template
file_path = hf_hub_download(repo_id=model_name, filename="tsq_prompt_template.txt")
with open(file_path, encoding="utf-8") as f:
    PROMPT_TEMPLATE = f.read()

# Example inputs
example_tools_list = '''[
  {
    "name": "order_food",
    "description": "Order food from a restaurant.\nArgs:\nrestaurant_url: URL of the restaurant\nitem_name: Name of the item to order",
    "inputSchema": {
      "type": "object",
      "title": "order_foodArguments",
      "required": ["item_url", "item_name"],
      "properties": {
        "item_url": {
          "type": "string",
          "title": "Item Url"
        },
        "item_name": {
          "type": "string",
          "title": "Item Name"
        }
      }
    }
  }
'''


example_message_history = '''[
  {
    "role": "user",
    "content": "Could you please order 2 Margherita pizzas for delivery to 123 Main Street, Anytown?"
  },
  {
    "completion_message": {
      "content": {
        "type": "text",
        "text": ""
      },
      "role": "assistant",
      "stop_reason": "tool_calls",
      "tool_calls": [
        {
          "id": "call_p8yj1p",
          "function": {
            "name": "order_food",
            "arguments": {
              "item": "Margherita Pizza",
              "quantity": 3, 
              "delivery_address": "123 Main Street, Anytown"
            }
          }
        }
      ]
    }
  }
]'''

# Format input
example_input = PROMPT_TEMPLATE.format(
    message_history=example_message_history,
    available_tools=example_tools_list
)

# Get prediction
result = pipe(example_input)[0]
print(result)

✨ Example Output

{'label': 'PARAM_VALUE_ERROR', 'score': 0.8680815696716309}

The value for quantity should be 2, not 3. Therefore, the correct label is: PARAM_VALUE_ERROR.