arxiv:2508.09045

Per-Query Visual Concept Learning

Published on Aug 12

Authors:

Ori Malca ,

Abstract

Adding a prompt and noise seed specific personalization step using self- and cross-attention loss terms improves visual concept learning in text-to-image models.

AI-generated summary

Visual concept learning, also known as Text-to-image personalization, is the process of teaching new concepts to a pretrained model. This has numerous applications from product placement to entertainment and personalized design. Here we show that many existing methods can be substantially augmented by adding a personalization step that is (1) specific to the prompt and noise seed, and (2) using two loss terms based on the self- and cross- attention, capturing the identity of the personalized concept. Specifically, we leverage PDM features -- previously designed to capture identity -- and show how they can be used to improve personalized semantic similarity. We evaluate the benefit that our method gains on top of six different personalization methods, and several base text-to-image models (both UNet- and DiT-based). We find significant improvements even over previous per-query personalization methods.

View arXiv page View PDF Add to collection

Community

Orimalca

Paper author 5 days ago

•

edited 1 day ago

Our work adds a single personalization step on top of pre-trained text-to-image personalization checkpoints that is (1) specific to the prompt and noise seed, and (2) uses two loss terms based on the self- and cross-attention, capturing the identity of the personalized concept.

With just a single gradient update (~4 seconds on an NVIDIA H100 GPU) and a single image of the target concept, our method:
(1) Achieves new SoTA results by enhancing per-query generation of personalization checkpoints, with an average gain of +8% image alignment and +23% text alignment. For example, it improve LoRA by (+7% / +14%).
(2) Outperforms previous per-query methods with an average of +12% image alignment and +13% text alignment.

It is compatible with a wide range of personalization techniques (e.g., DreamBooth, LoRA, Textual Inversion) and supports various diffusion backbones, including UNet-based models (e.g., SDXL, SD) and transformer-based models (e.g., FLUX, SD3).

📑 Paper: https://arxiv.org/abs/2508.09045
🌐 Paper Page: https://per-query-visual-concept-learning.github.io/