ROSE: Remove Objects with Side Effects in Videos
This repository contains the finetuned WanTransformer3D weights for ROSE, a model for removing objects with side effects in videos.
\ud83d\udcda Paper - \ud83c\udf10 Project Page - \ud83d\udcbb Code - \ud83e\udd17 Demo
Abstract
Video object removal has achieved advanced performance due to the recent success of video generative models. However, when addressing the side effects of objects, e.g., their shadows and reflections, existing works struggle to eliminate these effects for the scarcity of paired video data as supervision. This paper presents ROSE, termed Remove Objects with Side Effects, a framework that systematically studies the object's effects on environment, which can be categorized into five common cases: shadows, reflections, light, translucency and mirror. Given the challenges of curating paired videos exhibiting the aforementioned effects, we leverage a 3D rendering engine for synthetic data generation. We carefully construct a fully-automatic pipeline for data preparation, which simulates a large-scale paired dataset with diverse scenes, objects, shooting angles, and camera trajectories. ROSE is implemented as an video inpainting model built on diffusion transformer. To localize all object-correlated areas, the entire video is fed into the model for reference-based erasing. Moreover, additional supervision is introduced to explicitly predict the areas affected by side effects, which can be revealed through the differential mask between the paired videos. To fully investigate the model performance on various side effect removal, we presents a new benchmark, dubbed ROSE-Bench, incorporating both common scenarios and the five special side effects for comprehensive evaluation. Experimental results demonstrate that ROSE achieves superior performance compared to existing video object erasing models and generalizes well to real-world video scenarios.
Dependencies and Installation
Clone Repo
git clone https://github.com/Kunbyte-AI/ROSE.git
Create Conda Environment and Install Dependencies
# create new anaconda env conda create -n rose python=3.12 -y conda activate rose # install python dependencies pip3 install -r requirements.txt
- CUDA = 12.4
- PyTorch = 2.6.0
- Torchvision = 0.21.0
- Other required packages in
requirements.txt
Usage (Quick Test)
To get started, you need to prepare the pretrained models first.
Prepare pretrained models We use pretrained
Wan2.1-Fun-1.3B-InP
as our base model. During training, we only train the WanTransformer3D part and keep other parts frozen. You can download the weight of Transformer3D of ROSE from thislink
.For local inference, the
weights
directory should be arranged like this:weights βββ transformer βββ config.json βββ diffusion_pytorch_model.safetensors
Also, it's necessary to prepare the base model in the models directory. You can download the Wan2.1-Fun-1.3B-InP base model from this
link
.The
models
directory will be arranged like this:models βββ Wan2.1-Fun-1.3B-InP βββ google βββ umt5-xxl βββ spiece.model βββ special_tokens_map.json ... βββ xlm-roberta-large βββ sentencepiece.bpe.model βββ tokenizer_config.json ... βββ config.json βββ configuration.json βββ diffusion_pytorch_model.safetensors βββ models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth βββ models_t5_umt5-xxl-enc-bf16.pth βββ Wan2.1_VAE.pth
Run Inference We provide some examples in the
data/eval
folder. Run the following command to try it out:python inference.py \ --validation_videos "path/to/your/video.mp4" \ --validation_masks "path/to/your/mask.mp4" \ --validation_prompts "" \ --output_dir "./output" \ --video_length 16 \ --sample_size 480 720
For more options, refer to the usage information in the GitHub repository:
Usage: python inference.py [options] Options: --validation_videos Path(s) to input videos --validation_masks Path(s) to mask videos --validation_prompts Text prompts (default: [""]) --output_dir Output directory --video_length Number of frames per video (It needs to be 16n+1.) --sample_size Frame size: height width (default: 480 720)
An interactive demo is also available on Hugging Face Spaces.
Results
Shadow
Masked Input | Output |
---|---|
![]() |
![]() |
![]() |
![]() |
Reflection
Masked Input | Output |
---|---|
![]() |
![]() |
![]() |
![]() |
Common
Masked Input | Output |
---|---|
![]() |
![]() |
![]() |
![]() |
Light Source
Masked Input | Output |
---|---|
![]() |
![]() |
![]() |
![]() |
Translucent
Masked Input | Output |
---|---|
![]() |
![]() |
![]() |
![]() |
Mirror
Masked Input | Output |
---|---|
![]() |
![]() |
![]() |
![]() |
Overview
Citation
If you find our repo useful for your research, please consider citing our paper:
@article{miao2025rose,
title={ROSE: Remove Objects with Side Effects in Videos},
author={Miao, Chenxuan and Feng, Yutong and Zeng, Jianshu and Gao, Zixiang and Liu, Hantang and Yan, Yunfeng and Qi, Donglian and Chen, Xi and Wang, Bin and Zhao, Hengshuang},
journal={arXiv preprint arXiv:2508.18633},
year={2025}
}
Acknowledgement
This code is based on Wan2.1-Fun-1.3B-Inpaint and some code are brought from ProPainter. Thanks for their awesome works!
- Downloads last month
- 115
Model tree for Kunbyte/ROSE
Base model
alibaba-pai/Wan2.1-Fun-1.3B-InP