AutoMat: Enabling Automated Crystal Structure Reconstruction from Microscopy via Agentic Tool Use Paper • 2505.12650 • Published 26 days ago • 7
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step Paper • 2501.13926 • Published Jan 23 • 42
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model Paper • 2501.15830 • Published Jan 27 • 14
Exploring the Potential of Encoder-free Architectures in 3D LMMs Paper • 2502.09620 • Published Feb 13 • 26
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions Paper • 2409.15278 • Published Sep 23, 2024 • 26
SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners Paper • 2408.16768 • Published Aug 29, 2024 • 29
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? Paper • 2403.14624 • Published Mar 21, 2024 • 54
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models Paper • 2311.07575 • Published Nov 13, 2023 • 15