AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning
Abstract
AdvEvo-MARL, a co-evolutionary multi-agent reinforcement learning framework, enhances safety and utility in LLM-based multi-agent systems by internally optimizing task agents against evolving attacks without additional overhead.
LLM-based multi-agent systems excel at planning, tool use, and role coordination, but their openness and interaction complexity also expose them to jailbreak, prompt-injection, and adversarial collaboration. Existing defenses fall into two lines: (i) self-verification that asks each agent to pre-filter unsafe instructions before execution, and (ii) external guard modules that police behaviors. The former often underperforms because a standalone agent lacks sufficient capacity to detect cross-agent unsafe chains and delegation-induced risks; the latter increases system overhead and creates a single-point-of-failure-once compromised, system-wide safety collapses, and adding more guards worsens cost and complexity. To solve these challenges, we propose AdvEvo-MARL, a co-evolutionary multi-agent reinforcement learning framework that internalizes safety into task agents. Rather than relying on external guards, AdvEvo-MARL jointly optimizes attackers (which synthesize evolving jailbreak prompts) and defenders (task agents trained to both accomplish their duties and resist attacks) in adversarial learning environments. To stabilize learning and foster cooperation, we introduce a public baseline for advantage estimation: agents within the same functional group share a group-level mean-return baseline, enabling lower-variance updates and stronger intra-group coordination. Across representative attack scenarios, AdvEvo-MARL consistently keeps attack-success rate (ASR) below 20%, whereas baselines reach up to 38.33%, while preserving-and sometimes improving-task accuracy (up to +3.67% on reasoning tasks). These results show that safety and utility can be jointly improved without relying on extra guard agents or added system overhead.
Community
LLM-based multi-agent systems are powerful but vulnerable to jailbreaks and prompt injection. We propose AdvEvo-MARL, which internalizes safety by co-training evolving attackers with task agents (defenders) in adversarial settings, plus a public baseline to reduce variance and improve coordination. Across attack scenarios, AdvEvo-MARL keeps ASR <20% (baselines up to 38.33%) while preserving or slightly improving task accuracy (+3.67%), achieving safety–utility gains without extra guard agents or added system overhead.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Learning to Deliberate: Meta-policy Collaboration for Agentic LLMs with Multi-agent Reinforcement Learning (2025)
- Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks (2025)
- Interactive Learning for LLM Reasoning (2025)
- bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs (2025)
- AgentRL: Scaling Agentic Reinforcement Learning with a Multi-Turn, Multi-Task Framework (2025)
- SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models (2025)
- InvThink: Towards AI Safety via Inverse Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper