arxiv:2510.05684

D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

Published on Oct 7

· Submitted by

Jaeyoon Jung on Oct 13

#2 Paper of the day

Upvote

116

Authors:

Jaeyoon Jung ,

Minchan Kim ,

Yongjun Cho ,

Yunsung Lee

Abstract

D2E framework uses desktop interactions to pretrain embodied AI, achieving high success rates in physical manipulation and navigation tasks.

AI-generated summary

Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection. Desktop environments -- particularly gaming -- offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning. We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains. Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152x compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation. Using 1.3K+ hours of data (259 hours of human demonstrations, and 1K+ hours of pseudo-labeled gameplay), we achieve a total of 96.6% success rate on LIBERO manipulation and 83.3% on CANVAS navigation benchmarks. This validates that sensorimotor primitives in digital interactions exhibit sufficient invariance to transfer meaningfully to physical embodied tasks, establishing desktop pretraining as a practical paradigm for robotics. We will make all our work public, including the OWA toolkit, datasets of human-collected and pseudo-labeled, and VAPT-trained models available at https://worv-ai.github.io/d2e/

View arXiv page View PDF Project page GitHub 27 Add to collection

Community

lastdefiance20

Paper author Paper submitter 1 day ago

•

edited 1 day ago

We present D2E 🎮→🤖, a framework that scales Vision-Action Pretraining on desktop interaction data to accelerate Embodied AI 🚀.
By turning ordinary game and desktop interactions into training fuel, D2E builds rich visuomotor priors that transfer from screens to robots

✨ OWA Toolkit 🖥️ — a unified recorder + storage format for multi-modal desktop data (screen, keyboard, mouse).
OWA compresses raw gameplay into a compact OWAMcap format — achieving 152× storage efficiency while preserving temporal precision ⚡.

🧠 Generalist-IDM — a universal inverse dynamics model predicting next-event tokens purely from timestamps ⏱️.
It generalizes to unseen games and enables pseudo-labeling of 1,055 h of YouTube gameplay, trained with the dataset far beyond 259 h of human-recorded data across 20 games 🎮📊.

🔬 VAPT (Vision-Action PreTraining) — pretraining a 1B-param InternVL3 backbone on our 1.3K hours dataset, then transferring to real-world robot domains 🦾.

🤖 When transferred to embodied domains, D2E achieves 96.6 %🔥 success on LIBERO-manipulation and 83.3 %🔥 on CANVAS-navigation, demonstrating strong generalization from desktop to real-world tasks.

🌍 D2E demonstrates that desktop-scale learning can unlock low-cost, high-transfer embodied intelligence and enable internet-scale embodied ai pretraining, bringing the gap between the digital and physical worlds.

📄 Paper: https://arxiv.org/abs/2510.05684
💻 Project: https://worv-ai.github.io/d2e/