Can a Single Model Master Both Multi-turn Conversations and Tool Use? CALM: A Unified Conversational Agentic Language Model
Abstract
Large Language Models (LLMs) with API-calling capabilities enabled building effective Language Agents (LA), while also revolutionizing the conventional task-oriented dialogue (TOD) paradigm. However, current approaches face a critical dilemma: TOD systems are often trained on a limited set of target APIs, requiring new data to maintain their quality when interfacing with new services, while LAs are not trained to maintain user intent over multi-turn conversations. Because both robust multi-turn management and advanced function calling are crucial for effective conversational agents, we evaluate these skills on three popular benchmarks: MultiWOZ 2.4 (TOD), BFCL V3 (LA), and API-Bank (LA), and our analyses reveal that specialized approaches excel in one domain but underperform in the other. To bridge this chasm, we introduce CALM (Conversational Agentic Language Model), a unified approach that integrates both conversational and agentic capabilities. We created CALM-IT, a carefully constructed multi-task dataset that interleave multi-turn ReAct reasoning with complex API usage. Using CALM-IT, we train three models CALM 8B, CALM 70B, and CALM 405B, which outperform top domain-specific models, including GPT-4o, across all three benchmarks.
Community
🚀 Can a Single Model Master Both Multi-turn Conversations and Tool Use?
Introducing CALM, fully open-source Conversational Agentic Language Models with CALM 8B, CALM 70B, and CALM 405B -excelling in both multi-turn dialogue management and function calling.
🦍CALM 405B is the largest open model in BFCL V3 Leaderboard, ranking #7, surpassing many proprietary models.
Leaderboard: https://lnkd.in/dxzassRC
Most models struggle with either long-term conversations and dialogue state tracking (TOD) or function-calling (LA). CALM (Conversational Agentic Language Model) bridges this gap! Trained on CALM-IT, our unified dataset blending multi-turn ReAct style TOD & complex API use, trained using the Oumi AI platform in partnership with Oumi and Together AI.
đź“Š Models: CALM 8B, CALM 70B, CALM-405B trained from Llama model series.
How does the CALM model family perform?
âś…Outperforms GPT-4o and other top domain-specific models on:
đź“ŚMultiWOZ 2.4 (TOD)
đź“ŚBFCL V3 (Function Calling)
đź“ŚAPI-Bank (Function Calling)
Achieving top zero-shot scores not in one but across all benchmarks!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Conversation Routines: A Prompt Engineering Framework for Task-Oriented Dialog Systems (2025)
- DeepThink: Aligning Language Models with Domain-Specific User Intents (2025)
- IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems (2025)
- FREYR: A Framework for Recognizing and Executing Your Requests (2025)
- InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection (2025)
- PoAct: Policy and Action Dual-Control Agent for Generalized Applications (2025)
- Self-Training Large Language Models for Tool-Use Without Demonstrations (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 3
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper