Papers
arxiv:2503.01865

Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints

Published on Feb 25
Authors:
,
,
,

Abstract

Jailbreaking attacks can effectively induce unsafe behaviors in Large Language Models (LLMs); however, the transferability of these attacks across different models remains limited. This study aims to understand and enhance the transferability of gradient-based jailbreaking methods, which are among the standard approaches for attacking white-box models. Through a detailed analysis of the optimization process, we introduce a novel conceptual framework to elucidate transferability and identify superfluous constraints-specifically, the response pattern constraint and the token tail constraint-as significant barriers to improved transferability. Removing these unnecessary constraints substantially enhances the transferability and controllability of gradient-based attacks. Evaluated on Llama-3-8B-Instruct as the source model, our method increases the overall Transfer Attack Success Rate (T-ASR) across a set of target models with varying safety levels from 18.4% to 50.3%, while also improving the stability and controllability of jailbreak behaviors on both source and target models.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.01865 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.01865 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.01865 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.