Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning
Abstract
Although natural language instructions offer an intuitive way to guide automated image editing, deep-learning models often struggle to achieve high-quality results, largely due to challenges in creating large, high-quality training datasets. Previous work has typically relied on text-toimage (T2I) generative models to produce pairs of original and edited images that simulate the input/output of an instruction-guided image-editing model. However, these image pairs often fail to align with the specified edit instructions due to the limitations of T2I models, which negatively impacts models trained on such datasets. To address this, we present Instruct-CLIP, a self-supervised method that learns the semantic changes between original and edited images to refine and better align the instructions in existing datasets. Furthermore, we adapt Instruct-CLIP to handle noisy latent images and diffusion timesteps so that it can be used to train latent diffusion models (LDMs) [19] and efficiently enforce alignment between the edit instruction and the image changes in latent space at any step of the diffusion pipeline. We use Instruct-CLIP to correct the InstructPix2Pix dataset and get over 120K refined samples we then use to fine-tune their model, guided by our novel Instruct-CLIP-based loss function. The resulting model can produce edits that are more aligned with the given instructions. Our code and dataset are available at https://github.com/SherryXTChen/Instruct-CLIP.git.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LUSD: Localized Update Score Distillation for Text-Guided Image Editing (2025)
- PhotoDoodle: Learning Artistic Image Editing from Few-Shot Pairwise Data (2025)
- MIGE: A Unified Framework for Multimodal Instruction-Based Image Generation and Editing (2025)
- Towards More Accurate Personalized Image Generation: Addressing Overfitting and Evaluation Bias (2025)
- DualNeRF: Text-Driven 3D Scene Editing via Dual-Field Representation (2025)
- Edit Transfer: Learning Image Editing via Vision In-Context Relations (2025)
- UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 3
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper