D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
Abstract
D2E framework uses desktop interactions to pretrain embodied AI, achieving high success rates in physical manipulation and navigation tasks.
Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection. Desktop environments -- particularly gaming -- offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning. We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains. Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152x compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation. Using 1.3K+ hours of data (259 hours of human demonstrations, and 1K+ hours of pseudo-labeled gameplay), we achieve a total of 96.6% success rate on LIBERO manipulation and 83.3% on CANVAS navigation benchmarks. This validates that sensorimotor primitives in digital interactions exhibit sufficient invariance to transfer meaningfully to physical embodied tasks, establishing desktop pretraining as a practical paradigm for robotics. We will make all our work public, including the OWA toolkit, datasets of human-collected and pseudo-labeled, and VAPT-trained models available at https://worv-ai.github.io/d2e/
Community
We present D2E ๐ฎโ๐ค, a framework that scales Vision-Action Pretraining on desktop interaction data to accelerate Embodied AI ๐.
By turning ordinary game and desktop interactions into training fuel, D2E builds rich visuomotor priors that transfer from screens to robots
โจ OWA Toolkit ๐ฅ๏ธ โ a unified recorder + storage format for multi-modal desktop data (screen, keyboard, mouse).
OWA compresses raw gameplay into a compact OWAMcap format โ achieving 152ร storage efficiency while preserving temporal precision โก.
๐ง Generalist-IDM โ a universal inverse dynamics model predicting next-event tokens purely from timestamps โฑ๏ธ.
It generalizes to unseen games and enables pseudo-labeling of 1,055 h of YouTube gameplay, trained with the dataset far beyond 259 h of human-recorded data across 20 games ๐ฎ๐.
๐ฌ VAPT (Vision-Action PreTraining) โ pretraining a 1B-param InternVL3 backbone on our 1.3K hours dataset, then transferring to real-world robot domains ๐ฆพ.
๐ค When transferred to embodied domains, D2E achieves 96.6 %๐ฅ success on LIBERO-manipulation and 83.3 %๐ฅ on CANVAS-navigation, demonstrating strong generalization from desktop to real-world tasks.
๐ D2E demonstrates that desktop-scale learning can unlock low-cost, high-transfer embodied intelligence and enable internet-scale embodied ai pretraining, bringing the gap between the digital and physical worlds.
๐ Paper: https://arxiv.org/abs/2510.05684
๐ป Project: https://worv-ai.github.io/d2e/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Watch and Learn: Learning to Use Computers from Online Videos (2025)
- Learning Primitive Embodied World Models: Towards Scalable Robotic Learning (2025)
- EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control (2025)
- Igniting VLMs toward the Embodied Space (2025)
- F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions (2025)
- Latent Action Pretraining Through World Modeling (2025)
- StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper