Limbic-Tool-Use MCP Function Call Evaluator

This model is a fine-tuned version of Qwen2.5-0.5B-Instruct specifically designed for evaluating function calls in the context of Model Context Protocol (MCP) tools. It can assess whether a function call is correct, uses the wrong tool, has incorrect parameter names, or has incorrect parameter values.

Model Details

Base Model: Qwen/Qwen2.5-0.5B-Instruct
Fine-tuning Method: LoRA (Low-Rank Adaptation)
Task: Function Call Evaluation for MCP (Model Context Protocol)
Training Data: MCP Server Tools data from public MCP servers, with augmentation / synthetic data generation
Model Size: ~40MB (LoRA adapters only)
Context Length: 32,768 tokens

Model Usage

Model Prompts

The prompt for the model takes two inputs:

available_tools - a list of the tool schemas
message_history - the user request and model tool call response as a list of jsons

EVALUATOR_PROMPT = """\
# TOOL CALL EVALUATION RUBRIC

## EVALUATION CRITERIA

### 1. TOOL SELECTION
- [ ] Function name exists in available tools
- [ ] Function purpose matches user intent

### 2. PARAMETER STRUCTURE  
- [ ] All required and relevant parameters are present
- [ ] No hallucinated parameter names
- [ ] Parameter names match tool schema exactly

### 3. PARAMETER VALUES
- [ ] Data types match expected types
- [ ] Values align with user request
- [ ] No fabricated or incorrect values

## CLASSIFICATION RULES
- All criteria passed → `correct`
- Failed criteria 1 → `incorrect_tool`
- Failed criteria 2 → `incorrect_parameter_names`  
- Failed criteria 3 → `incorrect_parameter_values`

---
### AVAILABLE TOOLS
{available_tools}

---
### MESSAGE HISTORY
{message_history}

---
## OUTPUT REQUIREMENT
{{
    "score": < correct | incorrect_tool | incorrect_parameter_names | incorrect_parameter_values >,
    "reason": < [if incorrect, provide a brief list of reasons] >
}}

### EVALUATION:
"""

SYSTEM_PROMPT = "You are an expert evaluator of function calls. You will be given a function call and a list of available tools. You will need to evaluate the function call and return a score and a reason for the score."

Example Inputs

available_tools = [
    {
        "name": "google-play-developer",
        "description": "Get apps by a developer on Google Play",
        "input_schema": {
            "type": "object",
            "properties": {
                "devId": {"type": "string", "description": "Developer ID"},
                "num": {"type": "number", "default": 60, "description": "Number of results"},
                "lang": {"type": "string", "default": "en", "description": "Language code"},
                "country": {"type": "string", "default": "us", "description": "Country code"}
            },
            "required": ["devId"]
        }
    }
]

message_history = [
    {"role": "user", "content": "I'm looking to evaluate the performance of all the apps developed by 'Example Developer' on the Google Play Store. Could you provide me with a list of their recent applications, specifically in English and focused on the US market? Please limit the results to 50 apps for a quicker review."},
    {"role": "assistant", "content": {"function": "name": "google-play-developer", "arguments": {"devId": "com.example.developer", "num": 50, "lang": "en", "country": "us"}}}
]

Output Format

The model outputs evaluations in JSON format:

{
    "score": "correct|incorrect_tool|incorrect_parameter_names|incorrect_parameter_values",
    "reason": ["reasons for failure if incorrect"]
}

Score Categories

correct: Function call matches available tools and parameters exactly
incorrect_tool: Function name doesn't exist in available tools
incorrect_parameter_names: Function exists but parameter names are wrong
incorrect_parameter_values: Function and parameters exist but values are inappropriate

Load the Model

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("quotientai/limbic-tool-use-0.5B-32K")
model = AutoModelForCausalLM.from_pretrained("quotientai/limbic-tool-use-0.5B-32K")

Generate a Prediction

To make a prediction, you must convert the formatted prompt into its chat format.

chat_template = [
  {"role": "system", "content": SYSTEM_PROMPT},
  {"role": "user", "content": "<your-formatted-user-prompt>"}
]
# Apply the chat template
text = tokenizer.apply_chat_template(chat_template, tokenize=False, add_generation_prompt=True)

# Tokenize with truncation
inputs = tokenizer(text, return_tensors="pt", truncation=True).to("cuda")

# Generate your prediction
result = model.generate(**inputs, max_new_tokens=128, use_cache=True)

Citation

@model{limbic-tool-use-0.5B-32K,
  title={Limbic Tool Use Evaluator},
  author={QuotientAI},
  year={2025},
  url={https://huggingface.co/quotientai/limbic-tool-use-0.5B-32K}
}

quotientai
/

limbic-tool-use-0.5B-32K