Fashion-RAG
Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation

International Joint Conference on Neural Networks (IJCNN) 2025
Oral Presentation
Fulvio Sanguigni1,2,*, Davide Morelli1,2,*, Marcella Cornia1, Rita Cucchiara1
1University of Modena, 2University of Pisa
Table of Contents
About The Project
Fashion-RAG is a novel approach in the fashion domain, handling multimodal virtual dressing with a new, Retrieval Augmented Generation (RAG) pipeline for visual data. Our approach allows to retrieve garments aligned with a given textual description, and uses the retrieved information as a conditioning to generate the dressed person with Stable Diffusion (SD) as the generative model. We finetune the SD U-Net and an additional adapter module (Inversion Adapter) to handle for the retrieved information.
β¨ Key Features
Our contribution can be summarized as follows:
- π Retrieval Enhanced Generation for Visual Items. We present a unified framework capable of generating Virtual Dressing without the need of a user-defined garment image. Instead, our method succesfully leverages textual information and retrieves coherent garments to perform the task
- πππ§₯ Multiple Garments Conditioning. We introduce a plug-and-play adapter module that is flexible to the number of retrieved items, allowing to retrieve up to 3 garments per text prompt.
- π Extensive experiments. Experiments on the Dress Code datasets demonstrate that Fahion-RAG outweights previous competitors.
Getting Started
Prerequisites
Clone the repository:
git clone Fashion-RAG.git
Installation
- We recommend installing the required packages using Python's native virtual environment (venv) as follows:
python -m venv fashion-rag source fashion-rag/bin/activate
- Upgrade pip and install dependencies
pip install --upgrade pip pip install -r requirements.txt
- Create a .env file like the following:
export WANDB_API_KEY="ENTER YOUR WANDB TOKEN" export TORCH_HOME="ENTER YOUR TORCH PATH TO SAVE TORCH MODELS USED FOR METRICS COMPUTING" export HF_TOKEN="ENTER YOUR HUGGINGFACE TOKEN" export HF_CACHE_DIR="PATH WHERE YOU WANT TO SAVE THE HF MODELS (NEED CUSTOM VARIABLE TO ACCOUNT FOR OLD TRANSFORMERS PACKAGES, OTHERWISE USE HF_HOME)"
Data and Models
Download DressCode from the original repository Download the finetuned U-Net and Inversion Adapter from this source and put them into your experiment folder as follows:
<experiment folder>/
β
βββ unet_120000.pth
βββ inversion_adapter_120000.pth
Copy the provided retrieval file paths folder dataset/dresscode-retrieval into your retrieve path or use them directly.
Inference
Let's generate our virtual dressing images using the finetuned TEMU-VTOFF model.
source fashion-rag/bin/activate
python evaluate_RAG.py \
python evaluate_RAG.py \
--pretrained_model_name_or_path stabilityai/stable-diffusion-2-inpainting \
--output_dir "output directory path" \
--finetuned_models_dir "U-Net and inversion adapter directory weights path" \
--unet_name unet_120000.pth --inversion_adapter_name inversion_adapter_120000.pth \
--dataset dresscode --dresscode_dataroot <data path>/DressCode \
--category "garment category"\
--test_order "specify paired or unpaired" --mask_type mask \
--phase test --num_inference_steps 50 \
--test_batch_size 8 --num_workers_test 8 --metrics_batch_size 8 --mixed_precision fp16 \
--text_usage prompt_noun_chunks \
--retrieve_path "dataset/dresscode-retrieval or your custom path" \
--clip_retrieve_model ViT-L-14 --clip_retrieve_weights laion2b_s32b_b82k \
--n_chunks "number of text chunks 1 or 3" \
--n_retrieved "number of retrieved images 1 to 3" \
--metrics fid_score kid_score retrieved_score clip_score lpips_score ssim_score \
--attention_layers_fine_list '-1' '0 1 2 3'\
--compute_metrics
The final output folder structure will look like this:
out_dir/pte_paired_nc_<number_of_chunks>_nr_<number_of_retrieved_images>/
β
βββ lower_body/
βββ upper_body/
βββ dresses/
βββ metrics_all.json