π Performance on BFCL Benchmark
Source: From Tool Use to Agentic Evaluation of Large Language Models (BFCL)
πΉ Non-Live Evaluation (Overall: 85.25%)
Task Category | Accuracy |
---|---|
AST Summary | 86.46% |
Simple AST | 72.83% |
Python Simple | 96.50% |
Java Simple | 60.00% |
JavaScript Simple | 62.00% |
Multiple AST | 92.50% |
Parallel AST | 92.00% |
Parallel Multiple AST | 88.50% |
Irrelevance Detection | 80.42% |
πΉ Live Evaluation (Overall: 74.46%)
Task Category | Accuracy |
---|---|
AST Summary | 75.87% |
Python Simple AST | 76.36% |
Python Multiple AST | 76.26% |
Python Parallel AST | 56.25% |
Python Parallel Multiple AST | 66.67% |
Irrelevance Detection | 72.22% |
Relevance Detection | 77.78% |
Qwen2.5-14B-Instruct-APIGen-MT-5k
This model is a fine-tuned version of Qwen/Qwen2.5-14B-Instruct, tailored for tools call tasks. It has been trained on the Salesforce/APIGen-MT-5k dataset to enhance its ability to do tool calls based on user instructions.
π§ Model Details
- Base Model: Qwen/Qwen2.5-14B-Instruct
- Model Size: 14B parameters
- Fine-tuning Method: sft with LoRA (Low-Rank Adaptation)
ποΈ Training Configuration
Setting | Value |
---|---|
Dataset | Salesforce/APIGen-MT-5k |
Epochs | 3 |
Batch Size | 64 |
Learning Rate | 1e-5 |
Weight Decay | 2e-6 |
Scheduler | Cosine |
LoRA Rank | 16 |
Quantization | 4-bit during training |
π§ How to Use
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "miazhao/Qwen2.5-14B-Instruct-APIGen-MT-5k"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
device_map="auto", # or .to("cuda") / .to("cpu")
)
# Example conversation with a tool call
messages = [
{"role": "user", "content": "Hi, how are you?"},
{"role": "assistant", "content": "Thanks. I am doing well. How can I help you?"},
{"role": "user", "content": "What's the weather like in London?"},
]
tools = [
{
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "The city and state, e.g. San Francisco, CA"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "The unit of temperature to return"}
},
"required": ["location"]
}
}
]
print("====== prompt after applying chat template ======")
print(tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, tokenize=False))
inputs = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, return_dict=True, return_tensors="pt")
input_ids_len = inputs["input_ids"].shape[-1]
inputs = {k: v.to(model.device) for k, v in inputs.items()}
print("====== model response ======")
outputs = model.generate(**inputs, max_new_tokens=256)
generated_tokens = outputs[:, input_ids_len:]
print(tokenizer.decode(generated_tokens[0], skip_special_tokens=True))
Expected response
Sure, let me check the current weather in London for you.
<tool_call>
{"name": "get_weather", "arguments": {"location": "London"}}
</tool_call>
- Downloads last month
- 88
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support