DiffusionRenderer: Neural Inverse and Forward Rendering with Video Diffusion Models
Ruofan Liang*, Zan Gojcic, Huan Ling, Jacob Munkberg, Jon Hasselgren, Zhi-Hao Lin, Jun Gao, Alexander Keller, Nandita Vijaykumar, Sanja Fidler, Zian Wang*
* indicates equal contribution
Overview.
Description:
Diffusion Renderer is a video diffusion model that accurately estimates geometry and material buffers, and generates photorealistic images under specified lighting conditions, offering fundamental tools for image editing applications. This model is accessible for non-commercial use.
License/Terms of Use:
The model is distributed under the Nvidia Source Code License.
Deployment Geography:
Global
Use Case:
AI research, development and benchmarking for image/video delighting and relighting tasks.
Reference(s):
Project page: https://research.nvidia.com/labs/toronto-ai/DiffusionRenderer/
Model Architecture:
Architecture Type: The architecture type of the model is a combination of Convolutional Neural Network (CNN) and Transformer.
Network Architecture: UNET. This model has 1.1B model parameters.
Input:
Input Type(s): Image, Video
Input Format(s): Red, Green, Blue (RGB); Video input are frames of images.
Input Parameters:
The input video data are five-dimensional, with the input dimension specified as [batch_size, num_frames, height, width, 3], where the three channels are Red, Green, Blue (RGB) channels.
The input image data are five-dimensional [batch_size, 1, height, width, 3], with the frame channel set to 1.
Other Properties Related to Input: The default resolution is 512 x 512.
Output:
Output Type(s): Image, Video
Output Format(s): Red, Green, Blue (RGB); Output video is saved as image frames.
Output Parameters:
The output video data are five-dimensional, with the dimension specified as [batch_size, num_frames, height, width, 3], where the three channels are Red, Green, Blue (RGB) channels.
The output image data are five-dimensional [batch_size, 1, height, width, 3], with the frame channel set to 1.
Other Properties Related to Output: The default resolution is 512 x 512.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration:
Runtime Engine(s): Not Applicable - Uses Python Scripts and Pytorch
Supported Hardware Microarchitecture Compatibility: NVIDIA Ampere (A100, A5000)
[Preferred/Supported] Operating System(s): ['Linux']
Model Version(s):
diffusion_renderer-inverse-svd
: estimates geometry and material buffers from input image or video.diffusion_renderer-forward-svd
: generates photorealistic images or videos from G-buffers and environment map.
Training Dataset:
The training data contains samples with paired data of RGB video frames, G-buffers (base color, roughness, metallic, normals, depth), and lighting (HDR environment maps).
Data Collection Method: Synthetic
Labeling Method: Hybrid: Synthetic, Automated
Properties:
- 150k synthetic videos, 24 frames each at 512×512 resolution.
- 150k auto-labeled real-videos, 24 frames each at 512×960 resolution.
For additional details, please refer to the paper.
Inference:
Engine: Tensor(RT)
Test Hardware:
A100 and A5000 GPUs
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns here.