Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time
Abstract
A novel method, Preference Hijacking (Phi), manipulates Multimodal Large Language Model (MLLM) response preferences using specially crafted images, demonstrating significant effectiveness across various tasks.
Recently, Multimodal Large Language Models (MLLMs) have gained significant attention across various domains. However, their widespread adoption has also raised serious safety concerns. In this paper, we uncover a new safety risk of MLLMs: the output preference of MLLMs can be arbitrarily manipulated by carefully optimized images. Such attacks often generate contextually relevant yet biased responses that are neither overtly harmful nor unethical, making them difficult to detect. Specifically, we introduce a novel method, Preference Hijacking (Phi), for manipulating the MLLM response preferences using a preference hijacked image. Our method works at inference time and requires no model modifications. Additionally, we introduce a universal hijacking perturbation -- a transferable component that can be embedded into different images to hijack MLLM responses toward any attacker-specified preferences. Experimental results across various tasks demonstrate the effectiveness of our approach. The code for Phi is accessible at https://github.com/Yifan-Lan/Phi.
Community
We uncover a new safety risk of MLLMs: the output preference of MLLMs can be arbitrarily manipulated by carefully optimized images. Such attacks often generate contextually relevant yet biased responses that are neither overtly harmful nor unethical, making them especially challenging to detect and defend against. We refer to this class of attacks as preference hijacking (Phi). Furthermore, we introduce the universal hijacking perturbation—a transferable component that can be embedded into diverse images to steer MLLM responses toward attacker-specified preferences. This enhances the scalability and efficiency of the attack, thereby amplifying its overall risk.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering (2025)
- Defense-to-Attack: Bypassing Weak Defenses Enables Stronger Jailbreaks in Vision-Language Models (2025)
- Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs (2025)
- Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models (2025)
- MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts? (2025)
- Activation Steering Meets Preference Optimization: Defense Against Jailbreaks in Vision Language Models (2025)
- PromptSleuth: Detecting Prompt Injection via Semantic Intent Invariance (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper