Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language
Abstract
Chat2Workflow presents a benchmark and agentic framework for automating executable visual workflow generation from natural language, revealing significant challenges in achieving industrial-grade automation despite advances in language models.
At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve-making development costly, time-consuming, and error-prone. To study whether large language models can automate this multi-round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic framework to mitigate recurrent execution errors. Chat2Workflow is built from a large collection of real-world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state-of-the-art language models can often capture high-level intent, they struggle to generate correct, stable, and executable workflows, especially under complex or changing requirements. Although our agentic framework yields up to 5.34% resolve rate gains, the remaining real-world gap positions Chat2Workflow as a foundation for advancing industrial-grade automation. Code is available at https://github.com/zjunlp/Chat2Workflow.
Community
Chat2Workflow is a benchmark and agent framework that aims to automatically generate deployable visual workflows from natural language, but current LLMs still struggle to produce correct and stable executable pipelines in real-world scenarios.
Interesting breakdown of this paper on arXivLens: https://arxivlens.com/PaperView/Details/chat2workflow-a-benchmark-for-generating-executable-visual-workflows-with-natural-language-1532-79bdf282
Covers the executive summary, detailed methodology, and practical applications.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows (2026)
- FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications (2026)
- Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification (2026)
- kRAIG: A Natural Language-Driven Agent for Automated DataOps Pipeline Generation (2026)
- MiroFlow: Towards High-Performance and Robust Open-Source Agent Framework for General Deep Research Tasks (2026)
- Beyond Isolated Tasks: A Framework for Evaluating Coding Agents on Sequential Software Evolution (2026)
- Production-Grade AI Coding System for Client-Side Development (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.19667 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper