TEMU-VTOFF / README.md
davidelobba's picture
Update README.md
653a076 verified
metadata
license: cc-by-nc-4.0
base_model:
  - stabilityai/stable-diffusion-3-medium-diffusers
pipeline_tag: image-to-image
tags:
  - image-generation
  - image-to-image
  - virtual-try-on
  - virtual-try-off
  - diffusion
  - dit
  - stable-diffusion-3
  - multimodal
  - fashion
  - pytorch
language: en
datasets:
  - dresscode
  - viton-hd

TEMU-VTOFF

Text-Enhanced MUlti-category Virtual Try-Off

TEMU-VTOFF Teaser

Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals Davide Lobba1,2,*, Fulvio Sanguigni2,3,*, Bin Ren1,2, Marcella Cornia3, Rita Cucchiara3, Nicu Sebe1 1University of Trento, 2University of Pisa, 3University of Modena and Reggio Emilia * Equal contribution

πŸ’‘ Model Description

TEMU-VTOFF is a novel dual-DiT (Diffusion Transformer) architecture designed for the Virtual Try-Off task: generating in-shop images of garments worn by a person. By combining a pretrained feature extractor with a text-enhanced generation module, our method can handle occlusions, multiple garment categories, and ambiguous appearances. It further refines generation fidelity via a feature alignment module based on DINOv2.

This model is based on stabilityai/stable-diffusion-3-medium-diffusers. The uploaded weights correspond to the finetuned feature extractor and the VTOFF DiT module.

✨ Key Features

Our contribution can be summarized as follows:

  • 🎯 Multi-Category Try-Off. We present a unified framework capable of handling multiple garment types (upper-body, lower-body, and full-body clothes) without requiring category-specific pipelines.
  • πŸ”— Multimodal Hybrid Attention. We introduce a novel attention mechanism that integrates garment textual descriptions into the generative process by linking them with person-specific features. This helps the model synthesize occluded or ambiguous garment regions more accurately.
  • ⚑ Garment Aligner Module. We design a lightweight aligner that conditions generation on clean garment images, replacing conventional denoising objectives. This leads to better alignment consistency on the overall dataset and preserves more precise visual retention.
  • πŸ“Š Extensive experiments. Experiments on the Dress Code and VITON-HD datasets demonstrate that TEMU-VTOFF outperforms prior methods in both the quality of generated images and alignment with the target garment, highlighting its strong generalization capabilities.