arxiv:2504.20438

PixelHacker: Image Inpainting with Structural and Semantic Consistency

Published on Apr 29

· Submitted by

Uyoung on May 5

#1 Paper of the day

Upvote

Authors:

Ziyang Xu ,

Xiaolei Shen ,

Wenyu Liu ,

Xiaoxin Chen ,

Xinggang Wang

Abstract

Image inpainting is a fundamental research area between image editing and image generation. Recent state-of-the-art (SOTA) methods have explored novel attention mechanisms, lightweight architectures, and context-aware modeling, demonstrating impressive performance. However, they often struggle with complex structure (e.g., texture, shape, spatial relations) and semantics (e.g., color consistency, object restoration, and logical correctness), leading to artifacts and inappropriate generation. To address this challenge, we design a simple yet effective inpainting paradigm called latent categories guidance, and further propose a diffusion-based model named PixelHacker. Specifically, we first construct a large dataset containing 14 million image-mask pairs by annotating foreground and background (potential 116 and 21 categories, respectively). Then, we encode potential foreground and background representations separately through two fixed-size embeddings, and intermittently inject these features into the denoising process via linear attention. Finally, by pre-training on our dataset and fine-tuning on open-source benchmarks, we obtain PixelHacker. Extensive experiments show that PixelHacker comprehensively outperforms the SOTA on a wide range of datasets (Places2, CelebA-HQ, and FFHQ) and exhibits remarkable consistency in both structure and semantics. Project page at https://hustvl.github.io/PixelHacker.

View arXiv page View PDF Project page GitHub repository Add to collection

Community

Uyoung

Paper author Paper submitter 3 days ago

🌟Highlights

Latent Categories Guidance (LCG): Simple yet effective inpainting paradigm with superior structural and semantic consistency. Let's advance inpainting research to challenge more complex scenarios!
PixelHacker: Diffusion-based inpainting model trained with LCG, outperforming SOTA performance across multiple natural-scene (Places2) and human-face (CelebA-HQ, and FFHQ) benchmarks!
Comprehensive SOTA Performance：
- Places2 (Natural Scene)
  - Evaluated at 512 resolution using 10k test set images with 40-50% masked regions, PixelHacker achieved the best performance with FID 8.59 and LPIPS 0.2026.
  - Evaluated at 512 resolution using 36.5k validation set images with large and small mask settings, PixelHacker achieved the best performance on FID (large: 2.05, small: 0.82) and U-IDS (large:36.07, small:42.21), and the second best performance on LPIPS (large:0.169, small:0.088).
  - Evaluated at 256 and 512 resolutions using validation set images with a highly randomised masking strategy, PixelHacker achieved the best performance at 512 resolution with FID 5.75 and LPIPS 0.305, and the second best performance at 256 resolution with FID 9.25 and LPIPS 0.367.
- CelebA-HQ (Human-Face Scene)
  - Evaluated at 512 resolution, PixelHacker achieved the best performance with FID 4.75 and LPIPS 0.115.
- FFHQ (Human-Face Scene)
  - Evaluated at 256 resolution, PixelHacker achieved the best performance with FID 6.35 and LPIPS 0.229.