Collections
Discover the best community collections!
Collections including paper arxiv:2508.07917
-
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
Paper • 2508.07407 • Published • 84 -
A Survey on Diffusion Language Models
Paper • 2508.10875 • Published • 33 -
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Paper • 2508.06471 • Published • 161 -
Noise Hypernetworks: Amortizing Test-Time Compute in Diffusion Models
Paper • 2508.09968 • Published • 15
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 37 -
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Paper • 2507.16746 • Published • 33 -
MolmoAct: Action Reasoning Models that can Reason in Space
Paper • 2508.07917 • Published • 40
-
Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning
Paper • 2506.06205 • Published • 29 -
BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation
Paper • 2506.07530 • Published • 20 -
Ark: An Open-source Python-based Framework for Robot Learning
Paper • 2506.21628 • Published • 16 -
RoboBrain 2.0 Technical Report
Paper • 2507.02029 • Published • 30
-
Gemini Robotics: Bringing AI into the Physical World
Paper • 2503.20020 • Published • 28 -
Magma: A Foundation Model for Multimodal AI Agents
Paper • 2502.13130 • Published • 58 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper • 2410.23218 • Published • 51
-
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
Paper • 2502.11573 • Published • 9 -
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper • 2502.02339 • Published • 22 -
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper • 2502.11775 • Published • 9 -
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper • 2412.18319 • Published • 40
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 37 -
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
Paper • 2507.04447 • Published • 43 -
A Survey on Vision-Language-Action Models for Autonomous Driving
Paper • 2506.24044 • Published • 14 -
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments
Paper • 2507.10548 • Published • 36
-
Unified Vision-Language-Action Model
Paper • 2506.19850 • Published • 27 -
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Paper • 2506.01844 • Published • 128 -
3D-VLA: A 3D Vision-Language-Action Generative World Model
Paper • 2403.09631 • Published • 10 -
QUAR-VLA: Vision-Language-Action Model for Quadruped Robots
Paper • 2312.14457 • Published • 1
-
Describe Anything: Detailed Localized Image and Video Captioning
Paper • 2504.16072 • Published • 63 -
EmbodiedCity: A Benchmark Platform for Embodied Agent in Real-world City Environment
Paper • 2410.09604 • Published -
Geospatial Mechanistic Interpretability of Large Language Models
Paper • 2505.03368 • Published • 10 -
Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation
Paper • 2505.02836 • Published • 7
-
Cosmos World Foundation Model Platform for Physical AI
Paper • 2501.03575 • Published • 81 -
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Paper • 2501.00599 • Published • 48 -
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints
Paper • 2501.03841 • Published • 56 -
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives
Paper • 2501.04003 • Published • 28
-
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
Paper • 2502.11573 • Published • 9 -
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper • 2502.02339 • Published • 22 -
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper • 2502.11775 • Published • 9 -
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper • 2412.18319 • Published • 40
-
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
Paper • 2508.07407 • Published • 84 -
A Survey on Diffusion Language Models
Paper • 2508.10875 • Published • 33 -
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Paper • 2508.06471 • Published • 161 -
Noise Hypernetworks: Amortizing Test-Time Compute in Diffusion Models
Paper • 2508.09968 • Published • 15
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 37 -
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
Paper • 2507.04447 • Published • 43 -
A Survey on Vision-Language-Action Models for Autonomous Driving
Paper • 2506.24044 • Published • 14 -
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments
Paper • 2507.10548 • Published • 36
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 37 -
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Paper • 2507.16746 • Published • 33 -
MolmoAct: Action Reasoning Models that can Reason in Space
Paper • 2508.07917 • Published • 40
-
Unified Vision-Language-Action Model
Paper • 2506.19850 • Published • 27 -
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Paper • 2506.01844 • Published • 128 -
3D-VLA: A 3D Vision-Language-Action Generative World Model
Paper • 2403.09631 • Published • 10 -
QUAR-VLA: Vision-Language-Action Model for Quadruped Robots
Paper • 2312.14457 • Published • 1
-
Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning
Paper • 2506.06205 • Published • 29 -
BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation
Paper • 2506.07530 • Published • 20 -
Ark: An Open-source Python-based Framework for Robot Learning
Paper • 2506.21628 • Published • 16 -
RoboBrain 2.0 Technical Report
Paper • 2507.02029 • Published • 30
-
Describe Anything: Detailed Localized Image and Video Captioning
Paper • 2504.16072 • Published • 63 -
EmbodiedCity: A Benchmark Platform for Embodied Agent in Real-world City Environment
Paper • 2410.09604 • Published -
Geospatial Mechanistic Interpretability of Large Language Models
Paper • 2505.03368 • Published • 10 -
Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation
Paper • 2505.02836 • Published • 7
-
Gemini Robotics: Bringing AI into the Physical World
Paper • 2503.20020 • Published • 28 -
Magma: A Foundation Model for Multimodal AI Agents
Paper • 2502.13130 • Published • 58 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper • 2410.23218 • Published • 51
-
Cosmos World Foundation Model Platform for Physical AI
Paper • 2501.03575 • Published • 81 -
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Paper • 2501.00599 • Published • 48 -
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints
Paper • 2501.03841 • Published • 56 -
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives
Paper • 2501.04003 • Published • 28