Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2508.07917

All models for the MolmoAct (Multimodal Open Language Model for Action) release.

MolmoAct: Action Reasoning Models that can Reason in Space

Paper • 2508.07917 • Published 16 days ago • 40
allenai/MolmoAct-7B-D-0812

Robotics • 8B • Updated 10 days ago • 722 • 22
allenai/MolmoAct-7B-O-0812

Robotics • 8B • Updated 10 days ago • 113 • 4
allenai/MolmoAct-7B-D-Pretrain-0812

Robotics • 8B • Updated 10 days ago • 712 • 8

Research and ideas

A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

Paper • 2508.07407 • Published 17 days ago • 84
A Survey on Diffusion Language Models

Paper • 2508.10875 • Published 13 days ago • 33
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Paper • 2508.06471 • Published 19 days ago • 161
Noise Hypernetworks: Amortizing Test-Time Compute in Diffusion Models

Paper • 2508.09968 • Published 14 days ago • 15

Vision Language Action models

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Paper • 2507.01925 • Published Jul 2 • 37
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

Paper • 2507.16746 • Published Jul 22 • 33
MolmoAct: Action Reasoning Models that can Reason in Space

Paper • 2508.07917 • Published 16 days ago • 40

Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning

Paper • 2506.06205 • Published Jun 6 • 29
BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation

Paper • 2506.07530 • Published Jun 9 • 20
Ark: An Open-source Python-based Framework for Robot Learning

Paper • 2506.21628 • Published Jun 24 • 16
RoboBrain 2.0 Technical Report

Paper • 2507.02029 • Published Jul 2 • 30

Multimodal Agent

about 10 hours ago

Gemini Robotics: Bringing AI into the Physical World

Paper • 2503.20020 • Published Mar 25 • 28
Magma: A Foundation Model for Multimodal AI Agents

Paper • 2502.13130 • Published Feb 18 • 58
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 51
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Paper • 2410.23218 • Published Oct 30, 2024 • 51

Multimodal Reasoning

InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning

Paper • 2502.11573 • Published Feb 17 • 9
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking

Paper • 2502.02339 • Published Feb 4 • 22
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model

Paper • 2502.11775 • Published Feb 17 • 9
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Paper • 2412.18319 • Published Dec 24, 2024 • 40

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Paper • 2507.01925 • Published Jul 2 • 37
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Paper • 2507.04447 • Published Jul 6 • 43
A Survey on Vision-Language-Action Models for Autonomous Driving

Paper • 2506.24044 • Published Jun 30 • 14
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments

Paper • 2507.10548 • Published Jul 14 • 36

Vision Language Models for Robotics

Unified Vision-Language-Action Model

Paper • 2506.19850 • Published Jun 24 • 27
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Paper • 2506.01844 • Published Jun 2 • 128
3D-VLA: A 3D Vision-Language-Action Generative World Model

Paper • 2403.09631 • Published Mar 14, 2024 • 10
QUAR-VLA: Vision-Language-Action Model for Quadruped Robots

Paper • 2312.14457 • Published Dec 22, 2023 • 1

about 12 hours ago

Describe Anything: Detailed Localized Image and Video Captioning

Paper • 2504.16072 • Published Apr 22 • 63
EmbodiedCity: A Benchmark Platform for Embodied Agent in Real-world City Environment

Paper • 2410.09604 • Published Oct 12, 2024
Geospatial Mechanistic Interpretability of Large Language Models

Paper • 2505.03368 • Published May 6 • 10
Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation

Paper • 2505.02836 • Published May 5 • 7

Cosmos World Foundation Model Platform for Physical AI

Paper • 2501.03575 • Published Jan 7 • 81
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Paper • 2501.00599 • Published Dec 31, 2024 • 48
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints

Paper • 2501.03841 • Published Jan 7 • 56
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

Paper • 2501.04003 • Published Jan 7 • 28

All models for the MolmoAct (Multimodal Open Language Model for Action) release.

MolmoAct: Action Reasoning Models that can Reason in Space

Paper • 2508.07917 • Published 16 days ago • 40
allenai/MolmoAct-7B-D-0812

Robotics • 8B • Updated 10 days ago • 722 • 22
allenai/MolmoAct-7B-O-0812

Robotics • 8B • Updated 10 days ago • 113 • 4
allenai/MolmoAct-7B-D-Pretrain-0812

Robotics • 8B • Updated 10 days ago • 712 • 8

Multimodal Reasoning

InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning

Paper • 2502.11573 • Published Feb 17 • 9
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking

Paper • 2502.02339 • Published Feb 4 • 22
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model

Paper • 2502.11775 • Published Feb 17 • 9
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Paper • 2412.18319 • Published Dec 24, 2024 • 40

Research and ideas

A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

Paper • 2508.07407 • Published 17 days ago • 84
A Survey on Diffusion Language Models

Paper • 2508.10875 • Published 13 days ago • 33
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Paper • 2508.06471 • Published 19 days ago • 161
Noise Hypernetworks: Amortizing Test-Time Compute in Diffusion Models

Paper • 2508.09968 • Published 14 days ago • 15

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Paper • 2507.01925 • Published Jul 2 • 37
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Paper • 2507.04447 • Published Jul 6 • 43
A Survey on Vision-Language-Action Models for Autonomous Driving

Paper • 2506.24044 • Published Jun 30 • 14
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments

Paper • 2507.10548 • Published Jul 14 • 36

Vision Language Action models

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Paper • 2507.01925 • Published Jul 2 • 37
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

Paper • 2507.16746 • Published Jul 22 • 33
MolmoAct: Action Reasoning Models that can Reason in Space

Paper • 2508.07917 • Published 16 days ago • 40

Vision Language Models for Robotics

Unified Vision-Language-Action Model

Paper • 2506.19850 • Published Jun 24 • 27
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Paper • 2506.01844 • Published Jun 2 • 128
3D-VLA: A 3D Vision-Language-Action Generative World Model

Paper • 2403.09631 • Published Mar 14, 2024 • 10
QUAR-VLA: Vision-Language-Action Model for Quadruped Robots

Paper • 2312.14457 • Published Dec 22, 2023 • 1

Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning

Paper • 2506.06205 • Published Jun 6 • 29
BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation

Paper • 2506.07530 • Published Jun 9 • 20
Ark: An Open-source Python-based Framework for Robot Learning

Paper • 2506.21628 • Published Jun 24 • 16
RoboBrain 2.0 Technical Report

Paper • 2507.02029 • Published Jul 2 • 30

about 12 hours ago

Describe Anything: Detailed Localized Image and Video Captioning

Paper • 2504.16072 • Published Apr 22 • 63
EmbodiedCity: A Benchmark Platform for Embodied Agent in Real-world City Environment

Paper • 2410.09604 • Published Oct 12, 2024
Geospatial Mechanistic Interpretability of Large Language Models

Paper • 2505.03368 • Published May 6 • 10
Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation

Paper • 2505.02836 • Published May 5 • 7

Multimodal Agent

about 10 hours ago

Gemini Robotics: Bringing AI into the Physical World

Paper • 2503.20020 • Published Mar 25 • 28
Magma: A Foundation Model for Multimodal AI Agents

Paper • 2502.13130 • Published Feb 18 • 58
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 51
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Paper • 2410.23218 • Published Oct 30, 2024 • 51

Cosmos World Foundation Model Platform for Physical AI

Paper • 2501.03575 • Published Jan 7 • 81
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Paper • 2501.00599 • Published Dec 31, 2024 • 48
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints

Paper • 2501.03841 • Published Jan 7 • 56
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

Paper • 2501.04003 • Published Jan 7 • 28

Previous
1
2
Next

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs