Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation
Abstract
A novel method disentangles visual and semantic features from diffusion model backbones to quantify and localize visual inconsistencies in subject-driven image generation.
We propose a novel approach for disentangling visual and semantic features from the backbones of pre-trained diffusion models, enabling visual correspondence in a manner analogous to the well-established semantic correspondence. While diffusion model backbones are known to encode semantically rich features, they must also contain visual features to support their image synthesis capabilities. However, isolating these visual features is challenging due to the absence of annotated datasets. To address this, we introduce an automated pipeline that constructs image pairs with annotated semantic and visual correspondences based on existing subject-driven image generation datasets, and design a contrastive architecture to separate the two feature types. Leveraging the disentangled representations, we propose a new metric, Visual Semantic Matching (VSM), that quantifies visual inconsistencies in subject-driven image generation. Empirical results show that our approach outperforms global feature-based metrics such as CLIP, DINO, and vision--language models in quantifying visual inconsistencies while also enabling spatial localization of inconsistent regions. To our knowledge, this is the first method that supports both quantification and localization of inconsistencies in subject-driven generation, offering a valuable tool for advancing this task. Project Page:https://abdo-eldesokey.github.io/mind-the-glitch/
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MOSAIC: Multi-Subject Personalized Generation via Correspondence-Aware Alignment and Disentanglement (2025)
- MILD: Multi-Layer Diffusion Strategy for Complex and Precise Multi-IP Aware Human Erasing (2025)
- SemanticControl: A Training-Free Approach for Handling Loosely Aligned Visual Conditions in ControlNet (2025)
- FocusDPO: Dynamic Preference Optimization for Multi-Subject Personalized Image Generation via Adaptive Focus (2025)
- Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception (2025)
- RAGSR: Regional Attention Guided Diffusion for Image Super-Resolution (2025)
- Stencil: Subject-Driven Generation with Context Guidance (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper