ADSKAILab
/

WaLa-MVDream-DM6

+---
+language:
+- en
+license: other
+license_name: autodesk-non-commercial-3d-generative-v1.0
+tags:
+- wala
+- text-to-depthmap
+---
+# Model Card for WaLa-MVDream-DM6
+This model is part of the Wavelet Latent Diffusion (WaLa) paper, capable of generating six-view depth maps from text descriptions to support text-to-3D generation.
+## Model Details
+### Model Description
+WaLa-MVDream-DM6 is a fine-tuned version of the MVDream model, adapted to generate six-view depth maps from text inputs. This model serves as an intermediate step in the text-to-3D generation pipeline of WaLa, producing multi-view depth maps that are then used by the WaLa-DM6-1B model to generate 3D shapes.
+- **Developed by:** Aditya Sanghi, Aliasghar Khani, Chinthala Pradyumna Reddy, Arianna Rampini, Derek Cheung, Kamal Rahimi Malekshan, Kanika Madan, Hooman Shayani
+- **Model type:** Text-to-Depth Map Generative Model
+- **License:** Autodesk Non-Commercial (3D Generative) v1.0
+For more information please look at the [Project](TBD) [Page](TBD) and [the paper](TBD).
+### Model Sources
+- **Repository:** [Github](https://github.com/AutodeskAILab/WaLa)
+- **Paper:** [ArXiv:TBD](TBD)
+- **Demo:** [TBD](TBD)
+## Uses
+### Direct Use
+This model is released by Autodesk and intended for academic and research purposes only for the theoretical exploration and demonstration of the WaLa 3D generative framework. It is designed to be used in conjunction with WaLa-DM6-1B for text-to-3D generation. Please see [here](TBD) for inferencing instructions.
+### Out-of-Scope Use
+The model should not be used for:
+- Commercial purposes
+- Generation of inappropriate or offensive content
+- Any usage not in compliance with the [license](https://huggingface.co/ADSKAILab/WaLa-MVDream-DM6/blob/main/LICENSE.md), in particular, the "Acceptable Use" section.
+## Bias, Risks, and Limitations
+### Bias
+- The model may inherit biases present in the text-image datasets used for pre-training and fine-tuning.
+- The model's performance may vary depending on the complexity and specificity of the input text descriptions.
+### Risks and Limitations
+- The quality of the generated multi-view depth maps may impact the subsequent 3D shape generation.
+- The model may occasionally generate depth maps that do not accurately represent the input text or maintain consistency across views.
+## How to Get Started with the Model
+Please refer to the instructions [here](TBD)
+## Training Details
+### Training Data
+The model was fine-tuned using captions generated for the WaLa dataset. Captions were initially created using the Internvl 2.0 model and then augmented using LLaMA 3.1 to enhance diversity and richness.
+### Training Procedure
+#### Preprocessing
+Captions were generated for each 3D object in the dataset using four renderings and two distinct prompts. These captions were then augmented to increase diversity. For depth map generation, six views were used to ensure comprehensive coverage of the entire object.
+#### Training Hyperparameters
+- **Training regime:** Please refer to the paper.
+#### Speeds, Sizes, Times
+[Information not provided in the paper]
+## Evaluation
+### Testing Data, Factors & Metrics
+[Specific evaluation details for this model are not provided in the paper]
+### Results
+[Specific results for this model are not provided in the paper]
+## Technical Specifications
+### Model Architecture and Objective
+The model is based on the MVDream architecture, fine-tuned to generate six-view depth maps from text inputs. It is designed to work in tandem with the WaLa-DM6-1B model for text-to-3D generation. The model uses the Stable Diffusion framework, initialized with weights from MVDream, and is fine-tuned on depth map-text paired data.
+### Compute Infrastructure
+#### Hardware
+[TBD]
+## Citation
+[Citation information to be added after paper publication]