📊 Performance on BFCL Benchmark

Source: From Tool Use to Agentic Evaluation of Large Language Models (BFCL)

🔹 Non-Live Evaluation (Overall: 85.25%)

Task Category	Accuracy
AST Summary	86.46%
Simple AST	72.83%
Python Simple	96.50%
Java Simple	60.00%
JavaScript Simple	62.00%
Multiple AST	92.50%
Parallel AST	92.00%
Parallel Multiple AST	88.50%
Irrelevance Detection	80.42%

🔹 Live Evaluation (Overall: 74.46%)

Task Category	Accuracy
AST Summary	75.87%
Python Simple AST	76.36%
Python Multiple AST	76.26%
Python Parallel AST	56.25%
Python Parallel Multiple AST	66.67%
Irrelevance Detection	72.22%
Relevance Detection	77.78%

Qwen2.5-14B-Instruct-APIGen-MT-5k

This model is a fine-tuned version of Qwen/Qwen2.5-14B-Instruct, tailored for tools call tasks. It has been trained on the Salesforce/APIGen-MT-5k dataset to enhance its ability to do tool calls based on user instructions.

🧠 Model Details

Base Model: Qwen/Qwen2.5-14B-Instruct
Model Size: 14B parameters
Fine-tuning Method: sft with LoRA (Low-Rank Adaptation)

🏋️ Training Configuration

Setting	Value
Dataset	Salesforce/APIGen-MT-5k
Epochs	3
Batch Size	64
Learning Rate	1e-5
Weight Decay	2e-6
Scheduler	Cosine
LoRA Rank	16
Quantization	4-bit during training

🔧 How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "miazhao/Qwen2.5-14B-Instruct-APIGen-MT-5k"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto",     # or .to("cuda") / .to("cpu")
)

# Example conversation with a tool call
messages = [
    {"role": "user", "content": "Hi, how are you?"},
    {"role": "assistant", "content": "Thanks. I am doing well. How can I help you?"},
    {"role": "user", "content": "What's the weather like in London?"},
]

tools = [
    {
        "name": "get_weather",
        "description": "Get the current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "The city and state, e.g. San Francisco, CA"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "The unit of temperature to return"}
            },
            "required": ["location"]
        }
    }
]

print("====== prompt after applying chat template ======")
print(tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, tokenize=False))

inputs = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, return_dict=True, return_tensors="pt")
input_ids_len = inputs["input_ids"].shape[-1]
inputs = {k: v.to(model.device) for k, v in inputs.items()}

print("====== model response ======")
outputs = model.generate(**inputs, max_new_tokens=256)
generated_tokens = outputs[:, input_ids_len:]
print(tokenizer.decode(generated_tokens[0], skip_special_tokens=True))

Expected response

Sure, let me check the current weather in London for you.
<tool_call>
{"name": "get_weather", "arguments": {"location": "London"}}
</tool_call>

miazhao
/

Qwen2.5-14B-Instruct-APIGen-MT-5k