Papers
arxiv:2506.09600

Effective Red-Teaming of Policy-Adherent Agents

Published on Jun 11
· Submitted by itaynakash on Jun 16
#2 Paper of the day
Authors:
,
,
,

Abstract

CRAFT, a multi-agent system using policy-aware persuasive strategies, challenges policy-adherent LLM-based agents in customer service to assess and improve their robustness against adversarial attacks.

AI-generated summary

Task-oriented LLM-based agents are increasingly used in domains with strict policies, such as refund eligibility or cancellation rules. The challenge lies in ensuring that the agent consistently adheres to these rules and policies, appropriately refusing any request that would violate them, while still maintaining a helpful and natural interaction. This calls for the development of tailored design and evaluation methodologies to ensure agent resilience against malicious user behavior. We propose a novel threat model that focuses on adversarial users aiming to exploit policy-adherent agents for personal benefit. To address this, we present CRAFT, a multi-agent red-teaming system that leverages policy-aware persuasive strategies to undermine a policy-adherent agent in a customer-service scenario, outperforming conventional jailbreak methods such as DAN prompts, emotional manipulation, and coercive. Building upon the existing tau-bench benchmark, we introduce tau-break, a complementary benchmark designed to rigorously assess the agent's robustness against manipulative user behavior. Finally, we evaluate several straightforward yet effective defense strategies. While these measures provide some protection, they fall short, highlighting the need for stronger, research-driven safeguards to protect policy-adherent agents from adversarial attacks

Community

Paper submitter

This work demonstrates that policy-adherent agents are highly vulnerable to targeted, policy-aware adversarial users. Using CRAFT, a multi-agent red-teaming system, we significantly outperform generic jailbreaks by leveraging the agent’s own policy constraints. We also suggest a method to adapt a standart dataset into adversarial safety benchmarks, enabling scalable, realistic evaluation. Our findings reveal that simple defenses are not enough—robust protection requires new, research-driven approaches.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.09600 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.09600 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.09600 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.