furio19/fashion-rag · Hugging Face

Fashion-RAG

Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation

International Joint Conference on Neural Networks (IJCNN) 2025
Oral Presentation

Fulvio Sanguigni^1,2,*, Davide Morelli^1,2,*, Marcella Cornia¹, Rita Cucchiara¹
¹University of Modena, ²University of Pisa

Table of Contents

About The Project
Getting Started
- Prerequisites
- Installation
Data and Models
Inference

About The Project

Fashion-RAG is a novel approach in the fashion domain, handling multimodal virtual dressing with a new, Retrieval Augmented Generation (RAG) pipeline for visual data. Our approach allows to retrieve garments aligned with a given textual description, and uses the retrieved information as a conditioning to generate the dressed person with Stable Diffusion (SD) as the generative model. We finetune the SD U-Net and an additional adapter module (Inversion Adapter) to handle for the retrieved information.

(back to top)

✨ Key Features

Our contribution can be summarized as follows:

🔍 Retrieval Enhanced Generation for Visual Items. We present a unified framework capable of generating Virtual Dressing without the need of a user-defined garment image. Instead, our method succesfully leverages textual information and retrieves coherent garments to perform the task
👗👚🧥 Multiple Garments Conditioning. We introduce a plug-and-play adapter module that is flexible to the number of retrieved items, allowing to retrieve up to 3 garments per text prompt.
📊 Extensive experiments. Experiments on the Dress Code datasets demonstrate that Fahion-RAG outweights previous competitors.

Getting Started

Prerequisites

Clone the repository:

git clone Fashion-RAG.git

Installation

We recommend installing the required packages using Python's native virtual environment (venv) as follows:
```
python -m venv fashion-rag
source fashion-rag/bin/activate
```

Upgrade pip and install dependencies

pip install --upgrade pip
pip install -r requirements.txt

Create a .env file like the following:

export WANDB_API_KEY="ENTER YOUR WANDB TOKEN"
export TORCH_HOME="ENTER YOUR TORCH PATH TO SAVE TORCH MODELS USED FOR METRICS COMPUTING"
export HF_TOKEN="ENTER YOUR HUGGINGFACE TOKEN"
export HF_CACHE_DIR="PATH WHERE YOU WANT TO SAVE THE HF MODELS (NEED CUSTOM VARIABLE TO ACCOUNT FOR OLD TRANSFORMERS PACKAGES, OTHERWISE USE HF_HOME)"

Data and Models

Download DressCode from the original repository Download the finetuned U-Net and Inversion Adapter from this source and put them into your experiment folder as follows:

<experiment folder>/
│
├── unet_120000.pth
├── inversion_adapter_120000.pth

Copy the provided retrieval file paths folder dataset/dresscode-retrieval into your retrieve path or use them directly.

Inference

Let's generate our virtual dressing images using the finetuned TEMU-VTOFF model.

source fashion-rag/bin/activate

python evaluate_RAG.py \
    python evaluate_RAG.py \
    --pretrained_model_name_or_path stabilityai/stable-diffusion-2-inpainting \
    --output_dir "output directory path" \
    --finetuned_models_dir "U-Net and inversion adapter directory weights path" \
    --unet_name unet_120000.pth --inversion_adapter_name inversion_adapter_120000.pth \
    --dataset dresscode --dresscode_dataroot <data path>/DressCode \
    --category "garment category"\
    --test_order "specify paired or unpaired" --mask_type mask \
    --phase test --num_inference_steps 50 \
    --test_batch_size 8 --num_workers_test 8 --metrics_batch_size 8 --mixed_precision fp16 \
    --text_usage prompt_noun_chunks \
    --retrieve_path "dataset/dresscode-retrieval or your custom path" \
    --clip_retrieve_model ViT-L-14 --clip_retrieve_weights laion2b_s32b_b82k \
    --n_chunks "number of text chunks 1 or 3" \
    --n_retrieved "number of retrieved images 1 to 3" \
    --metrics fid_score kid_score retrieved_score clip_score lpips_score ssim_score \
    --attention_layers_fine_list '-1' '0 1 2 3'\
    --compute_metrics

The final output folder structure will look like this:

out_dir/pte_paired_nc_<number_of_chunks>_nr_<number_of_retrieved_images>/
│
├── lower_body/
├── upper_body/
├── dresses/
└── metrics_all.json