TencentARC
/

VideoPainter

@@ -12,6 +12,7 @@ tags:
 - video editing
 ---
 # VideoPainter
 This repository contains the implementation of the paper "VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control"
@@ -24,28 +25,21 @@ Keywords: Video Inpainting, Video Editing, Video Generation
 <p align="center">
-  <a href="https://yxbian23.github.io/project/video-painter">🌐Project Page</a> |
-  <a href="https://arxiv.org/abs/2503.05639">📜Arxiv</a> |
-  <a href="https://huggingface.co/collections/TencentARC/videopainter-67cc49c6146a48a2ba93d159">🗄️Data</a> |
-  <a href="https://youtu.be/HYzNfsD3A0s">📹Video</a> |
-  <a href="https://huggingface.co/TencentARC/VideoPainter">🤗Hugging Face Model</a> |
 </p>
 **📖 Table of Contents**
 - [VideoPainter](#videopainter)
   - [🔥 Update Log](#-update-log)
-  - [📌 TODO](#todo)
   - [🛠️ Method Overview](#️-method-overview)
   - [🚀 Getting Started](#-getting-started)
-    - [Environment Requirement 🌍](#environment-requirement-)
-    - [Data Download ⬇️](#data-download-️)
   - [🏃🏼 Running Scripts](#-running-scripts)
-    - [Training 🤯](#training-)
-    - [Inference 📜](#inference-)
-    - [Evaluation 📏](#evaluation-)
   - [🤝🏼 Cite Us](#-cite-us)
   - [💖 Acknowledgement](#-acknowledgement)
@@ -66,13 +60,14 @@ Keywords: Video Inpainting, Video Editing, Video Generation
 ## 🛠️ Method Overview
 We propose a novel dual-stream paradigm VideoPainter that incorporates an efficient context encoder (comprising only 6\% of the backbone parameters) to process masked videos and inject backbone-aware background contextual cues to any pre-trained video DiT, producing semantically consistent content in a plug-and-play manner. This architectural separation significantly reduces the model's learning complexity while enabling nuanced integration of crucial background context. We also introduce a novel target region ID resampling technique that enables any-length video inpainting, greatly enhancing our practical applicability. Additionally, we establish a scalable dataset pipeline leveraging current vision understanding models, contributing VPData and VPBench to facilitate segmentation-based inpainting training and assessment, the largest video inpainting dataset and benchmark to date with over 390K diverse clips. Using inpainting as a pipeline basis, we also explore downstream applications including video editing and video editing pair data generation, demonstrating competitive performance and significant practical potential.
-![](assets/method.jpg)
 ## 🚀 Getting Started
-### Environment Requirement 🌍
 Clone the repo:
@@ -109,8 +104,10 @@ Optional, you can install sam2 for gradio demo thourgh:
 cd ./app
 pip install -e .
 ```
-### Data Download ⬇️
 **VPBench and VPData**
@@ -186,8 +183,10 @@ cd data_utils
 python VPData_download.py
 ```
-**Checkpoints**
 Checkpoints of VideoPainter can be downloaded from [here](https://huggingface.co/TencentARC/VideoPainter). The ckpt folder contains
@@ -239,12 +238,12 @@ The ckpt structure should be like:
         |-- vae
         |-- ...
 ```
 ## 🏃🏼 Running Scripts
-### Training 🤯
 You can train the VideoPainter using the script:
@@ -387,11 +386,11 @@ accelerate launch --config_file accelerate_config_machine_single_ds_wo_cpu.yaml
   --p_random_brush 0.3 \
   --id_pool_resample_learnable
 ```
-### Inference 📜
 You can inference for the video inpainting or editing with the script:
@@ -411,7 +410,10 @@ bash edit_bench.sh
 ```
 Since VideoPainter is trained on public Internet videos, it primarily performs well on general scenarios. For high-quality industrial applications (e.g., product exhibitions, virtual try-on), we recommend training the model on your domain-specific data. We welcome and appreciate any contributions of trained models from the community!
 You can also inference through gradio demo:
@@ -423,9 +425,11 @@ CUDA_VISIBLE_DEVICES=0 python app.py \
     --id_adapter ../ckpt/VideoPainterID/checkpoints \
     --img_inpainting_model ../ckpt/flux_inp
 ```
-### Evaluation 📏
 You can evaluate using the script:
@@ -440,19 +444,16 @@ bash eval_edit.sh
 # video editing with ID resampling
 bash eval_editing_id_resample.sh
 ```
 ## 🤝🏼 Cite Us
 ```
-@misc{bian2025videopainteranylengthvideoinpainting,
-      title={VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control},
-      author={Yuxuan Bian and Zhaoyang Zhang and Xuan Ju and Mingdeng Cao and Liangbin Xie and Ying Shan and Qiang Xu},
-      year={2025},
-      eprint={2503.05639},
-      archivePrefix={arXiv},
-      primaryClass={cs.CV},
-      url={https://arxiv.org/abs/2503.05639},
 }
 ```

 - video editing
 ---
 # VideoPainter
 This repository contains the implementation of the paper "VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control"
 <p align="center">
+<a href='https://yxbian23.github.io/project/video-painter'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href="https://arxiv.org/abs/2503.05639"><img src="https://img.shields.io/badge/arXiv-2503.05639-b31b1b.svg"></a> <a href="https://youtu.be/HYzNfsD3A0s"><img src="https://img.shields.io/badge/YouTube-Video-red?logo=youtube"></a> <a href="https://github.com/TencentARC/VideoPainter"><img src="https://img.shields.io/badge/GitHub-Code-black?logo=github"></a> <a href='https://huggingface.co/datasets/TencentARC/VPData'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue'></a> <a href='https://huggingface.co/datasets/TencentARC/VPBench'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Benchmark-blue'></a> <a href="https://huggingface.co/TencentARC/VideoPainter"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue"></a>
 </p>
+**Your like and star mean a lot for us to develop this project!** ❤️
 **📖 Table of Contents**
 - [VideoPainter](#videopainter)
   - [🔥 Update Log](#-update-log)
+  - [TODO](#todo)
   - [🛠️ Method Overview](#️-method-overview)
   - [🚀 Getting Started](#-getting-started)
   - [🏃🏼 Running Scripts](#-running-scripts)
   - [🤝🏼 Cite Us](#-cite-us)
   - [💖 Acknowledgement](#-acknowledgement)
 ## 🛠️ Method Overview
 We propose a novel dual-stream paradigm VideoPainter that incorporates an efficient context encoder (comprising only 6\% of the backbone parameters) to process masked videos and inject backbone-aware background contextual cues to any pre-trained video DiT, producing semantically consistent content in a plug-and-play manner. This architectural separation significantly reduces the model's learning complexity while enabling nuanced integration of crucial background context. We also introduce a novel target region ID resampling technique that enables any-length video inpainting, greatly enhancing our practical applicability. Additionally, we establish a scalable dataset pipeline leveraging current vision understanding models, contributing VPData and VPBench to facilitate segmentation-based inpainting training and assessment, the largest video inpainting dataset and benchmark to date with over 390K diverse clips. Using inpainting as a pipeline basis, we also explore downstream applications including video editing and video editing pair data generation, demonstrating competitive performance and significant practical potential.
+![](assets/teaser.jpg)
 ## 🚀 Getting Started
+<details>
+<summary><b>Environment Requirement 🌍</b></summary>
 Clone the repo:
 cd ./app
 pip install -e .
 ```
+</details>
+<details>
+<summary><b>Data Download ⬇️</b></summary>
 **VPBench and VPData**
 python VPData_download.py
 ```
+</details>
+<details>
+<summary><b>Checkpoints</b></summary>
 Checkpoints of VideoPainter can be downloaded from [here](https://huggingface.co/TencentARC/VideoPainter). The ckpt folder contains
         |-- vae
         |-- ...
 ```
+</details>
 ## 🏃🏼 Running Scripts
+<details>
+<summary><b>Training 🤯</b></summary>
 You can train the VideoPainter using the script:
   --p_random_brush 0.3 \
   --id_pool_resample_learnable
 ```
+</details>
+<details>
+<summary><b>Inference 📜</b></summary>
 You can inference for the video inpainting or editing with the script:
 ```
 Since VideoPainter is trained on public Internet videos, it primarily performs well on general scenarios. For high-quality industrial applications (e.g., product exhibitions, virtual try-on), we recommend training the model on your domain-specific data. We welcome and appreciate any contributions of trained models from the community!
+</details>
+<details>
+<summary><b>Gradio Demo 🖌️</b></summary>
 You can also inference through gradio demo:
     --id_adapter ../ckpt/VideoPainterID/checkpoints \
     --img_inpainting_model ../ckpt/flux_inp
 ```
+</details>
+<details>
+<summary><b>Evaluation 📏</b></summary>
 You can evaluate using the script:
 # video editing with ID resampling
 bash eval_editing_id_resample.sh
 ```
+</details>
 ## 🤝🏼 Cite Us
 ```
+@article{bian2025videopainter,
+  title={VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control},
+  author={Bian, Yuxuan and Zhang, Zhaoyang and Ju, Xuan and Cao, Mingdeng and Xie, Liangbin and Shan, Ying and Xu, Qiang},
+  journal={arXiv preprint arXiv:2503.05639},
+  year={2025}
 }
 ```