πŸ“Š Performance on BFCL Benchmark

Source: From Tool Use to Agentic Evaluation of Large Language Models (BFCL)

πŸ”Ή Non-Live Evaluation (Overall: 85.25%)

Task Category Accuracy
AST Summary 86.46%
Simple AST 72.83%
Python Simple 96.50%
Java Simple 60.00%
JavaScript Simple 62.00%
Multiple AST 92.50%
Parallel AST 92.00%
Parallel Multiple AST 88.50%
Irrelevance Detection 80.42%

πŸ”Ή Live Evaluation (Overall: 74.46%)

Task Category Accuracy
AST Summary 75.87%
Python Simple AST 76.36%
Python Multiple AST 76.26%
Python Parallel AST 56.25%
Python Parallel Multiple AST 66.67%
Irrelevance Detection 72.22%
Relevance Detection 77.78%

Qwen2.5-14B-Instruct-APIGen-MT-5k

This model is a fine-tuned version of Qwen/Qwen2.5-14B-Instruct, tailored for tools call tasks. It has been trained on the Salesforce/APIGen-MT-5k dataset to enhance its ability to do tool calls based on user instructions.

🧠 Model Details

  • Base Model: Qwen/Qwen2.5-14B-Instruct
  • Model Size: 14B parameters
  • Fine-tuning Method: sft with LoRA (Low-Rank Adaptation)

πŸ‹οΈ Training Configuration

Setting Value
Dataset Salesforce/APIGen-MT-5k
Epochs 3
Batch Size 64
Learning Rate 1e-5
Weight Decay 2e-6
Scheduler Cosine
LoRA Rank 16
Quantization 4-bit during training

πŸ”§ How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "miazhao/Qwen2.5-14B-Instruct-APIGen-MT-5k"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto",     # or .to("cuda") / .to("cpu")
)

# Example conversation with a tool call
messages = [
    {"role": "user", "content": "Hi, how are you?"},
    {"role": "assistant", "content": "Thanks. I am doing well. How can I help you?"},
    {"role": "user", "content": "What's the weather like in London?"},
]

tools = [
    {
        "name": "get_weather",
        "description": "Get the current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "The city and state, e.g. San Francisco, CA"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "The unit of temperature to return"}
            },
            "required": ["location"]
        }
    }
]

print("====== prompt after applying chat template ======")
print(tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, tokenize=False))

inputs = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, return_dict=True, return_tensors="pt")
input_ids_len = inputs["input_ids"].shape[-1]
inputs = {k: v.to(model.device) for k, v in inputs.items()}

print("====== model response ======")
outputs = model.generate(**inputs, max_new_tokens=256)
generated_tokens = outputs[:, input_ids_len:]
print(tokenizer.decode(generated_tokens[0], skip_special_tokens=True))

Expected response

Sure, let me check the current weather in London for you.
<tool_call>
{"name": "get_weather", "arguments": {"location": "London"}}
</tool_call>
Downloads last month
88
Safetensors
Model size
14.8B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for miazhao/Qwen2.5-14B-Instruct-APIGen-MT-5k

Base model

Qwen/Qwen2.5-14B
Finetuned
(206)
this model

Dataset used to train miazhao/Qwen2.5-14B-Instruct-APIGen-MT-5k