FlexiTex: Enhancing Texture Generation with Visual Guidance Paper • 2409.12431 • Published 4 days ago • 7
3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion Paper • 2409.12957 • Published 3 days ago • 15
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning Paper • 2409.12183 • Published 4 days ago • 26
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion Paper • 2409.11406 • Published 5 days ago • 21
INTRA: Interaction Relationship-aware Weakly Supervised Affordance Grounding Paper • 2409.06210 • Published 13 days ago • 24
Evaluating Multiview Object Consistency in Humans and Image Models Paper • 2409.05862 • Published 13 days ago • 8
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct Paper • 2409.05840 • Published 13 days ago • 43
Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with Image-Based Surface Representation Paper • 2409.03718 • Published 17 days ago • 25
Compositional 3D-aware Video Generation with LLM Director Paper • 2409.00558 • Published 22 days ago • 14
Building and better understanding vision-language models: insights and future directions Paper • 2408.12637 • Published Aug 22 • 109
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders Paper • 2408.15998 • Published 25 days ago • 81
ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model Paper • 2408.16767 • Published 24 days ago • 29
Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation Paper • 2408.15239 • Published 26 days ago • 27
SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher Paper • 2408.14176 • Published 27 days ago • 58
LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation Paper • 2408.13252 • Published about 1 month ago • 23
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation Paper • 2408.12528 • Published Aug 22 • 50
MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning Paper • 2408.11001 • Published Aug 20 • 11
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model Paper • 2408.11039 • Published Aug 20 • 54
MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model Paper • 2408.10198 • Published Aug 19 • 32
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models Paper • 2408.08872 • Published Aug 16 • 96
UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling Paper • 2408.04810 • Published Aug 9 • 22
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models Paper • 2408.04594 • Published Aug 8 • 14
Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics Paper • 2408.04631 • Published Aug 8 • 8
An Object is Worth 64x64 Pixels: Generating 3D Object via Image Diffusion Paper • 2408.03178 • Published Aug 6 • 36
MeshAnything V2: Artist-Created Mesh Generation With Adjacent Mesh Tokenization Paper • 2408.02555 • Published Aug 5 • 28
TexGen: Text-Guided 3D Texture Generation with Multi-view Sampling and Resampling Paper • 2408.01291 • Published Aug 2 • 11
SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement Paper • 2408.00653 • Published Aug 1 • 27
FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention Paper • 2407.19918 • Published Jul 29 • 47
Theia: Distilling Diverse Vision Foundation Models for Robot Learning Paper • 2407.20179 • Published Jul 29 • 45
SHIC: Shape-Image Correspondences with no Keypoint Supervision Paper • 2407.18907 • Published Jul 26 • 38
SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency Paper • 2407.17470 • Published Jul 24 • 14
HoloDreamer: Holistic 3D Panoramic World Generation from Text Descriptions Paper • 2407.15187 • Published Jul 21 • 10
T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation Paper • 2407.14505 • Published Jul 19 • 24
VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control Paper • 2407.12781 • Published Jul 17 • 12
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models Paper • 2407.12772 • Published Jul 17 • 33
DreamCatalyst: Fast and High-Quality 3D Editing via Controlling Editability and Identity Preservation Paper • 2407.11394 • Published Jul 16 • 11
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation? Paper • 2407.04842 • Published Jul 5 • 52
VIMI: Grounding Video Generation through Multi-modal Instruction Paper • 2407.06304 • Published Jul 8 • 9
Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing Paper • 2407.08770 • Published Jul 11 • 19
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Paper • 2406.16860 • Published Jun 24 • 55
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation Paper • 2407.02371 • Published Jul 2 • 49
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models Paper • 2407.07895 • Published Jul 10 • 40