DiffusionRenderer: Neural Inverse and Forward Rendering with Video Diffusion Models

Ruofan Liang*, Zan Gojcic, Huan Ling, Jacob Munkberg, Jon Hasselgren, Zhi-Hao Lin, Jun Gao, Alexander Keller, Nandita Vijaykumar, Sanja Fidler, Zian Wang*

* indicates equal contribution

Paper | Project Page

Overview.

Description:

Diffusion Renderer is a video diffusion model that accurately estimates geometry and material buffers, and generates photorealistic images under specified lighting conditions, offering fundamental tools for image editing applications. This model is accessible for non-commercial use.

License/Terms of Use:

The model is distributed under the Nvidia Source Code License.

Deployment Geography:

Global

Use Case:

AI research, development and benchmarking for image/video delighting and relighting tasks.

Reference(s):

Project page: https://research.nvidia.com/labs/toronto-ai/DiffusionRenderer/

Model Architecture:

Architecture Type: The architecture type of the model is a combination of Convolutional Neural Network (CNN) and Transformer.

Network Architecture: UNET. This model has 1.1B model parameters.

Input:

Input Type(s): Image, Video

Input Format(s): Red, Green, Blue (RGB); Video input are frames of images.

Input Parameters: The input video data are five-dimensional, with the input dimension specified as [batch_size, num_frames, height, width, 3], where the three channels are Red, Green, Blue (RGB) channels.
The input image data are five-dimensional [batch_size, 1, height, width, 3], with the frame channel set to 1.

Other Properties Related to Input: The default resolution is 512 x 512.

Output:

Output Type(s): Image, Video

Output Format(s): Red, Green, Blue (RGB); Output video is saved as image frames.

Output Parameters: The output video data are five-dimensional, with the dimension specified as [batch_size, num_frames, height, width, 3], where the three channels are Red, Green, Blue (RGB) channels.
The output image data are five-dimensional [batch_size, 1, height, width, 3], with the frame channel set to 1.

Other Properties Related to Output: The default resolution is 512 x 512.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s): Not Applicable - Uses Python Scripts and Pytorch

Supported Hardware Microarchitecture Compatibility: NVIDIA Ampere (A100, A5000)

[Preferred/Supported] Operating System(s): ['Linux']

Model Version(s):

diffusion_renderer-inverse-svd: estimates geometry and material buffers from input image or video.
diffusion_renderer-forward-svd: generates photorealistic images or videos from G-buffers and environment map.

Training Dataset:

The training data contains samples with paired data of RGB video frames, G-buffers (base color, roughness, metallic, normals, depth), and lighting (HDR environment maps).

Data Collection Method: Synthetic

Labeling Method: Hybrid: Synthetic, Automated

Properties:

150k synthetic videos, 24 frames each at 512×512 resolution.
150k auto-labeled real-videos, 24 frames each at 512×960 resolution.

For additional details, please refer to the paper.

Inference:

Engine: Tensor(RT)

Test Hardware:

A100 and A5000 GPUs

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here.

nexuslrf
/

diffusion_renderer-inverse-svd