An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications
Abstract
The study identifies testing practices in AI agent frameworks and applications, highlighting a focus on deterministic components and a neglect of the Trigger component, suggesting improvements for robustness.
Foundation model (FM)-based AI agents are rapidly gaining adoption across diverse domains, but their inherent non-determinism and non-reproducibility pose testing and quality assurance challenges. While recent benchmarks provide task-level evaluations, there is limited understanding of how developers verify the internal correctness of these agents during development. To address this gap, we conduct the first large-scale empirical study of testing practices in the AI agent ecosystem, analyzing 39 open-source agent frameworks and 439 agentic applications. We identify ten distinct testing patterns and find that novel, agent-specific methods like DeepEval are seldom used (around 1%), while traditional patterns like negative and membership testing are widely adapted to manage FM uncertainty. By mapping these patterns to canonical architectural components of agent frameworks and agentic applications, we uncover a fundamental inversion of testing effort: deterministic components like Resource Artifacts (tools) and Coordination Artifacts (workflows) consume over 70% of testing effort, while the FM-based Plan Body receives less than 5%. Crucially, this reveals a critical blind spot, as the Trigger component (prompts) remains neglected, appearing in around 1% of all tests. Our findings offer the first empirical testing baseline in FM-based agent frameworks and agentic applications, revealing a rational but incomplete adaptation to non-determinism. To address it, framework developers should improve support for novel testing methods, application developers must adopt prompt regression testing, and researchers should explore barriers to adoption. Strengthening these practices is vital for building more robust and dependable AI agents.
Community
TL;DR
We present the first large-scale study of unit testing in the AI agent ecosystem. We identify 10 testing patterns and map them to 13 canonical components, revealing a strong inversion of testing effort toward deterministic infra (tools/workflows) and a critical blind spot: prompts are tested in ~1% of cases.
Why it matters
Benchmarks tell you if an agent can complete tasks. Tests tell you when it breaks—especially under non-determinism, tool errors, and model updates.
Key findings
- Inverted focus: Resource & Coordination Artifacts consume >70% of tests; Plan Body <5%.
- Prompt neglect: Triggers ≈1% of tests → high risk of silent drift when FMs evolve.
- Pattern usage: Heavy Parameterized/Membership/Negative testing; DeepEval & hyperparameter control ~1%.
- Canonical lens: We ground tests in 13 components (e.g., Resource/Coordination Artifacts, Plan Body, Trigger, Boundary/Observable/Constitutive Entities).
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- EnvX: Agentize Everything with Agentic AI (2025)
- Agentic AI Frameworks: Architectures, Protocols, and Design Challenges (2025)
- app.build: A Production Framework for Scaling Agentic Prompt-to-App Generation with Environment Scaffolding (2025)
- Multi-Agent Penetration Testing AI for the Web (2025)
- AI Agents for Web Testing: A Case Study in the Wild (2025)
- Rethinking Autonomy: Preventing Failures in AI-Driven Software Engineering (2025)
- Lita: Light Agent Uncovers the Agentic Coding Capabilities of LLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper