Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts
Abstract
A hybrid framework combining LLMs and human oversight is proposed to clean datasets and detect jailbreak attacks, with new strategies to enhance jailbreak success.
Evaluating jailbreak attacks is challenging when prompts are not overtly harmful or fail to induce harmful outputs. Unfortunately, many existing red-teaming datasets contain such unsuitable prompts. To evaluate attacks accurately, these datasets need to be assessed and cleaned for maliciousness. However, existing malicious content detection methods rely on either manual annotation, which is labor-intensive, or large language models (LLMs), which have inconsistent accuracy in harmful types. To balance accuracy and efficiency, we propose a hybrid evaluation framework named MDH (Malicious content Detection based on LLMs with Human assistance) that combines LLM-based annotation with minimal human oversight, and apply it to dataset cleaning and detection of jailbroken responses. Furthermore, we find that well-crafted developer messages can significantly boost jailbreak success, leading us to propose two new strategies: D-Attack, which leverages context simulation, and DH-CoT, which incorporates hijacked chains of thought. The Codes, datasets, judgements, and detection results will be released in github repository: https://github.com/AlienZhang1996/DH-CoT.
Community
We propose two text jailbreak attacks against commercial black-box LLMs and a malicious content detection method, and apply the latter to red team dataset cleaning and jailbreak response detection.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Mitigating Jailbreaks with Intent-Aware LLMs (2025)
- Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs (2025)
- Enhancing Jailbreak Attacks on LLMs via Persona Prompts (2025)
- Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models (2025)
- Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers (2025)
- UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases (2025)
- Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper