Universal Jailbreak Suffixes Are Strong Attention Hijackers
Abstract
Suffix-based jailbreaks exploit adversarial suffixes to hijack large language models, with effectiveness linked to suffix universality; the method can be enhanced and mitigated with minimal computational or utility cost.
We study suffix-based jailbreaksx2013a powerful family of attacks against large language models (LLMs) that optimize adversarial suffixes to circumvent safety alignment. Focusing on the widely used foundational GCG attack (Zou et al., 2023), we observe that suffixes vary in efficacy: some markedly more universalx2013generalizing to many unseen harmful instructionsx2013than others. We first show that GCG's effectiveness is driven by a shallow, critical mechanism, built on the information flow from the adversarial suffix to the final chat template tokens before generation. Quantifying the dominance of this mechanism during generation, we find GCG irregularly and aggressively hijacks the contextualization process. Crucially, we tie hijacking to the universality phenomenon, with more universal suffixes being stronger hijackers. Subsequently, we show that these insights have practical implications: GCG universality can be efficiently enhanced (up to times5 in some cases) at no additional computational cost, and can also be surgically mitigated, at least halving attack success with minimal utility loss. We release our code and data at http://github.com/matanbt/interp-jailbreak.
Community
Analyzing the underlying mechanism of suffix-based LLM jailbreaks, we find it relies on aggressively hijacking the model context 🥷, the more universal a suffix the stronger its hijacking. Exploiting this, we show how to both enhance and mitigate existing attacks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Alignment Under Pressure: The Case for Informed Adversaries When Evaluating LLM Defenses (2025)
- LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs (2025)
- One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs (2025)
- One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models (2025)
- Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts (2025)
- COSMIC: Generalized Refusal Direction Identification in LLM Activations (2025)
- SPIRIT: Patching Speech Language Models against Jailbreak Attacks (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper