--- tools: - en license: other tags: - mcp - tools quality - tool quality selection - tool selection - TSQ - TUQ - sequence-classification - tool-evaluation - function-call - limbic - tool use - tool quality pipeline_tag: text-classification language: - en base_model: - Qwen/Qwen3-0.6B --- ![](https://pixel.qualifire.ai/api/record/ranger) ## 🧠 Model Description Designed for evaluating function calls in the context of **Model Context Protocol (MCP) tools**. It can assess whether a function call is correct, uses the wrong tool, has incorrect parameter names, or has incorrect parameter values. The **mcp-tool-use-quality-ranger-0.6b** is a fine-tuned sequence classification model created to **evaluate the quality of function calls** in conversational AI systems. **Max Context Length:** **32,768 Tokens** It determines if a given function call: - Selects the correct tool - Has correct parameter names and structure - Contains correct parameter values It produces one of four possible classification labels: | Label | Meaning | |-------|---------| | **VALID_CALL** | ✅ The tool name, parameters, and values are all correct, or no suitable tool exists and no function call is made. | | **TOOL_ERROR** | ❌ The tool name does not exist or does not match the user intent. | | **PARAM_NAME_ERROR** | ❌ The correct tool is used, but parameter names are missing, extra, or incorrect. | | **PARAM_VALUE_ERROR** | ❌ Tool and parameter names are correct, but parameter values are wrong or incorrectly formatted. | --- ## ðŸ”― Quantized Version - ðŸŠķ **GGUF**: [qualifire/mcp-tool-use-quality-ranger-0.6b-GGUF](https://huggingface.co/qualifire/mcp-tool-use-quality-ranger-0.6b-GGUF) --- ## 📊 Benchmark Evaluation The **mcp-tool-use-quality-ranger-0.6b** was evaluated in a binary classification setting, where the prediction is **Correct** if the function call evaluation matched the gold label, and **Incorrect** otherwise. | Model | #Params | Avg. Latency | Avg Binary Accuracy | [Qualifire mcp-tool-use-quality Benchmark](https://huggingface.co/datasets/qualifire/mcp-tool-use-quality-benchmark) Binary Accuracy | [Limbic Benchmark](https://huggingface.co/datasets/quotientai/limbic-eval-tool-use-mcp) Binary Accuracy | |-----------------------------------------------|---------|--------------------|---------------------|--------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------| | qualifire/mcp-tool-use-quality-ranger-4b [private] | 4B | 0.30[sec] | 0.962 | 0.971 | 0.954 | | **qualifire/mcp-tool-use-quality-ranger-0.6b** | **0.6B**| **0.09[sec]** | **0.928** | **0.949** | **0.907** | | gemini-2.5-flash | - | 4.87[sec] | 0.858 | 0.871 | 0.845 | | quotientai/limbic-tool-use-0.5B-32K | 0.5B | 0.79[sec] | 0.798 | 0.708 | 0.887 | ### 📌 Metrics Definitions - **Avg. Binary Accuracy** – Mean accuracy across all evaluated benchmarks, where predictions are mapped to binary outcomes as follows: - **Qualifire TUQ Benchmark** - **Correct** → `VALID_CALL` - **Incorrect** → `TOOL_ERROR`, `PARAM_NAME_ERROR` or `PARAM_VALUE_ERROR` - **Limbic Benchmark** - **Correct** → `correct` - **Incorrect** → `incorrect_tool`, `incorrect_parameter_names` or `incorrect_parameter_values` - **Qualifire TUQ Benchmark** link – [Qualifire Tool Selection Quality Benchmark](https://huggingface.co/datasets/qualifire/mcp-tool-selection-quality-benchmark). - **Limbic Benchmark** link – [Limbic Eval Tool Use MCP Benchmark](https://huggingface.co/datasets/quotientai/limbic-eval-tool-use-mcp). --- ## 📜 Evaluation Prompt Template The model uses the following structured evaluation process: 1. **TOOL SELECTION** - Check if the tool name exists in `available_tools` - Check if tool purpose matches user intent - Fail → `TOOL_ERROR`❌ 2. **PARAMETER STRUCTURE** - All required parameters are present - No extra parameters - Parameter names exactly match the schema - Fail → `PARAM_NAME_ERROR`❌ 3. **PARAMETER VALUES** - Values have correct data types - Values match user request - No fabricated or incorrect values - Fail → `PARAM_VALUE_ERROR`❌ If all checks pass → `VALID_CALL`✅ --- ### ðŸ“Ķ Requirements - `transformers>=4.51.0` - `huggingface_hub` - `torch` --- ## ðŸ’ŧ Usage ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline import torch from huggingface_hub import hf_hub_download # Model name model_name = "qualifire/mcp-tool-use-quality-ranger-0.6b" # Load model and tokenizer model = AutoModelForSequenceClassification.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map='auto', ) tokenizer = AutoTokenizer.from_pretrained(model_name) # Create pipeline pipe = pipeline("text-classification", model=model, tokenizer=tokenizer) # Load prompt template file_path = hf_hub_download(repo_id=model_name, filename="tsq_prompt_template.txt") with open(file_path, encoding="utf-8") as f: PROMPT_TEMPLATE = f.read() # Example inputs example_tools_list = '''[ { "name": "order_food", "description": "Order food from a restaurant.\nArgs:\nrestaurant_url: URL of the restaurant\nitem_name: Name of the item to order", "inputSchema": { "type": "object", "title": "order_foodArguments", "required": ["item_url", "item_name"], "properties": { "item_url": { "type": "string", "title": "Item Url" }, "item_name": { "type": "string", "title": "Item Name" } } } } ''' example_message_history = '''[ { "role": "user", "content": "Could you please order 2 Margherita pizzas for delivery to 123 Main Street, Anytown?" }, { "completion_message": { "content": { "type": "text", "text": "" }, "role": "assistant", "stop_reason": "tool_calls", "tool_calls": [ { "id": "call_p8yj1p", "function": { "name": "order_food", "arguments": { "item": "Margherita Pizza", "quantity": 3, "delivery_address": "123 Main Street, Anytown" } } } ] } } ]''' # Format input example_input = PROMPT_TEMPLATE.format( message_history=example_message_history, available_tools=example_tools_list ) # Get prediction result = pipe(example_input)[0] print(result) ``` ## âœĻ Example Output ``` {'label': 'PARAM_VALUE_ERROR', 'score': 0.8680815696716309} ``` The value for quantity should be 2, not 3. Therefore, the correct label is: PARAM_VALUE_ERROR.