Pre-training Auto-regressive Robotic Models with 4D Representations
Abstract
Foundation models pre-trained on massive unlabeled datasets have revolutionized natural language and computer vision, exhibiting remarkable generalization capabilities, thus highlighting the importance of pre-training. Yet, efforts in robotics have struggled to achieve similar success, limited by either the need for costly robotic annotations or the lack of representations that effectively model the physical world. In this paper, we introduce ARM4R, an Auto-regressive Robotic Model that leverages low-level 4D Representations learned from human video data to yield a better pre-trained robotic model. Specifically, we focus on utilizing 3D point tracking representations from videos derived by lifting 2D representations into 3D space via monocular depth estimation across time. These 4D representations maintain a shared geometric structure between the points and robot state representations up to a linear transformation, enabling efficient transfer learning from human video data to low-level robotic control. Our experiments show that ARM4R can transfer efficiently from human video data to robotics and consistently improves performance on tasks across various robot environments and configurations.
Community
Pre-training Auto-regressive Robotic Models with 4D Representations
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation (2025)
- SAM2Act: Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation (2025)
- Motion Tracks: A Unified Representation for Human-Robot Transfer in Few-Shot Imitation Learning (2025)
- DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control (2025)
- GEVRM: Goal-Expressive Video Generation Model For Robust Visual Manipulation (2025)
- S$^2$-Diffusion: Generalizing from Instance-level to Category-level Skills in Robot Manipulation (2025)
- RoboHorizon: An LLM-Assisted Multi-View World Model for Long-Horizon Robotic Manipulation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper