Papers
arxiv:2504.20438

PixelHacker: Image Inpainting with Structural and Semantic Consistency

Published on Apr 29
· Submitted by Uyoung on May 5
#1 Paper of the day
Authors:
,
,
,

Abstract

Image inpainting is a fundamental research area between image editing and image generation. Recent state-of-the-art (SOTA) methods have explored novel attention mechanisms, lightweight architectures, and context-aware modeling, demonstrating impressive performance. However, they often struggle with complex structure (e.g., texture, shape, spatial relations) and semantics (e.g., color consistency, object restoration, and logical correctness), leading to artifacts and inappropriate generation. To address this challenge, we design a simple yet effective inpainting paradigm called latent categories guidance, and further propose a diffusion-based model named PixelHacker. Specifically, we first construct a large dataset containing 14 million image-mask pairs by annotating foreground and background (potential 116 and 21 categories, respectively). Then, we encode potential foreground and background representations separately through two fixed-size embeddings, and intermittently inject these features into the denoising process via linear attention. Finally, by pre-training on our dataset and fine-tuning on open-source benchmarks, we obtain PixelHacker. Extensive experiments show that PixelHacker comprehensively outperforms the SOTA on a wide range of datasets (Places2, CelebA-HQ, and FFHQ) and exhibits remarkable consistency in both structure and semantics. Project page at https://hustvl.github.io/PixelHacker.

Community

Paper author Paper submitter

🌟Highlights

  • Latent Categories Guidance (LCG): Simple yet effective inpainting paradigm with superior structural and semantic consistency. Let's advance inpainting research to challenge more complex scenarios!
  • PixelHacker: Diffusion-based inpainting model trained with LCG, outperforming SOTA performance across multiple natural-scene (Places2) and human-face (CelebA-HQ, and FFHQ) benchmarks!
  • Comprehensive SOTA Performance
    • Places2 (Natural Scene)
      • Evaluated at 512 resolution using 10k test set images with 40-50% masked regions, PixelHacker achieved the best performance with FID 8.59 and LPIPS 0.2026.
      • Evaluated at 512 resolution using 36.5k validation set images with large and small mask settings, PixelHacker achieved the best performance on FID (large: 2.05, small: 0.82) and U-IDS (large:36.07, small:42.21), and the second best performance on LPIPS (large:0.169, small:0.088).
      • Evaluated at 256 and 512 resolutions using validation set images with a highly randomised masking strategy, PixelHacker achieved the best performance at 512 resolution with FID 5.75 and LPIPS 0.305, and the second best performance at 256 resolution with FID 9.25 and LPIPS 0.367.
    • CelebA-HQ (Human-Face Scene)
      • Evaluated at 512 resolution, PixelHacker achieved the best performance with FID 4.75 and LPIPS 0.115.
    • FFHQ (Human-Face Scene)
      • Evaluated at 256 resolution, PixelHacker achieved the best performance with FID 6.35 and LPIPS 0.229.
Paper author Paper submitter
Paper author Paper submitter

Snipaste_2025-05-03_13-11-02.png

Paper author Paper submitter
This comment has been hidden (marked as Off-Topic)
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.20438 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.20438 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.20438 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.