arxiv:2509.12521

Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

Published on Sep 15

· Submitted by

Yifan Lan on Sep 17

Upvote

Authors:

Yifan Lan ,

Abstract

A novel method, Preference Hijacking (Phi), manipulates Multimodal Large Language Model (MLLM) response preferences using specially crafted images, demonstrating significant effectiveness across various tasks.

AI-generated summary

Recently, Multimodal Large Language Models (MLLMs) have gained significant attention across various domains. However, their widespread adoption has also raised serious safety concerns. In this paper, we uncover a new safety risk of MLLMs: the output preference of MLLMs can be arbitrarily manipulated by carefully optimized images. Such attacks often generate contextually relevant yet biased responses that are neither overtly harmful nor unethical, making them difficult to detect. Specifically, we introduce a novel method, Preference Hijacking (Phi), for manipulating the MLLM response preferences using a preference hijacked image. Our method works at inference time and requires no model modifications. Additionally, we introduce a universal hijacking perturbation -- a transferable component that can be embedded into different images to hijack MLLM responses toward any attacker-specified preferences. Experimental results across various tasks demonstrate the effectiveness of our approach. The code for Phi is accessible at https://github.com/Yifan-Lan/Phi.

View arXiv page View PDF GitHub 4 Add to collection

Community

yflantmy

Paper author Paper submitter 2 days ago

Figure 1: Preference Hijacking Examples.

Figure 2: Universal Hijacking Perturbation Examples.

yflantmy

Paper author Paper submitter 2 days ago

We uncover a new safety risk of MLLMs: the output preference of MLLMs can be arbitrarily manipulated by carefully optimized images. Such attacks often generate contextually relevant yet biased responses that are neither overtly harmful nor unethical, making them especially challenging to detect and defend against. We refer to this class of attacks as preference hijacking (Phi). Furthermore, we introduce the universal hijacking perturbation—a transferable component that can be embedded into diverse images to steer MLLM responses toward attacker-specified preferences. This enhances the scalability and efficiency of the attack, thereby amplifying its overall risk.

librarian-bot

1 day ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.12521 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.12521 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.