StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation
Abstract
An unsupervised method learns a compact state representation using a lightweight encoder and Diffusion Transformer decoder, improving robotic performance and enabling latent action decoding from static images.
A fundamental challenge in embodied intelligence is developing expressive and compact state representations for efficient world modeling and decision making. However, existing methods often fail to achieve this balance, yielding representations that are either overly redundant or lacking in task-critical information. We propose an unsupervised approach that learns a highly compressed two-token state representation using a lightweight encoder and a pre-trained Diffusion Transformer (DiT) decoder, capitalizing on its strong generative prior. Our representation is efficient, interpretable, and integrates seamlessly into existing VLA-based models, improving performance by 14.3% on LIBERO and 30% in real-world task success with minimal inference overhead. More importantly, we find that the difference between these tokens, obtained via latent interpolation, naturally serves as a highly effective latent action, which can be further decoded into executable robot actions. This emergent capability reveals that our representation captures structured dynamics without explicit supervision. We name our method StaMo for its ability to learn generalizable robotic Motion from compact State representation, which is encoded from static images, challenging the prevalent dependence to learning latent action on complex architectures and video data. The resulting latent actions also enhance policy co-training, outperforming prior methods by 10.4% with improved interpretability. Moreover, our approach scales effectively across diverse data sources, including real-world robot data, simulation, and human egocentric video.
Community
StaMo
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MoWM: Mixture-of-World-Models for Embodied Planning via Latent-to-Pixel Feature Modulation (2025)
- RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation (2025)
- ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context (2025)
- VGGT-DP: Generalizable Robot Control via Vision Foundation Models (2025)
- Seeing Space and Motion: Enhancing Latent Actions with Spatial and Dynamic Awareness for VLA (2025)
- Latent Action Pretraining Through World Modeling (2025)
- Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper