Papers
arxiv:2510.05684

D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

Published on Oct 7
ยท Submitted by Jaeyoon Jung on Oct 13
#2 Paper of the day
Authors:
,
,
,
,
,
,

Abstract

D2E framework uses desktop interactions to pretrain embodied AI, achieving high success rates in physical manipulation and navigation tasks.

AI-generated summary

Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection. Desktop environments -- particularly gaming -- offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning. We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains. Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152x compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation. Using 1.3K+ hours of data (259 hours of human demonstrations, and 1K+ hours of pseudo-labeled gameplay), we achieve a total of 96.6% success rate on LIBERO manipulation and 83.3% on CANVAS navigation benchmarks. This validates that sensorimotor primitives in digital interactions exhibit sufficient invariance to transfer meaningfully to physical embodied tasks, establishing desktop pretraining as a practical paradigm for robotics. We will make all our work public, including the OWA toolkit, datasets of human-collected and pseudo-labeled, and VAPT-trained models available at https://worv-ai.github.io/d2e/

Community

Paper author Paper submitter
โ€ข
edited 1 day ago

We present D2E ๐ŸŽฎโ†’๐Ÿค–, a framework that scales Vision-Action Pretraining on desktop interaction data to accelerate Embodied AI ๐Ÿš€.
By turning ordinary game and desktop interactions into training fuel, D2E builds rich visuomotor priors that transfer from screens to robots

โœจ OWA Toolkit ๐Ÿ–ฅ๏ธ โ€” a unified recorder + storage format for multi-modal desktop data (screen, keyboard, mouse).
OWA compresses raw gameplay into a compact OWAMcap format โ€” achieving 152ร— storage efficiency while preserving temporal precision โšก.

๐Ÿง  Generalist-IDM โ€” a universal inverse dynamics model predicting next-event tokens purely from timestamps โฑ๏ธ.
It generalizes to unseen games and enables pseudo-labeling of 1,055 h of YouTube gameplay, trained with the dataset far beyond 259 h of human-recorded data across 20 games ๐ŸŽฎ๐Ÿ“Š.

๐Ÿ”ฌ VAPT (Vision-Action PreTraining) โ€” pretraining a 1B-param InternVL3 backbone on our 1.3K hours dataset, then transferring to real-world robot domains ๐Ÿฆพ.

๐Ÿค– When transferred to embodied domains, D2E achieves 96.6 %๐Ÿ”ฅ success on LIBERO-manipulation and 83.3 %๐Ÿ”ฅ on CANVAS-navigation, demonstrating strong generalization from desktop to real-world tasks.

๐ŸŒ D2E demonstrates that desktop-scale learning can unlock low-cost, high-transfer embodied intelligence and enable internet-scale embodied ai pretraining, bringing the gap between the digital and physical worlds.

๐Ÿ“„ Paper: https://arxiv.org/abs/2510.05684
๐Ÿ’ป Project: https://worv-ai.github.io/d2e/

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.05684 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.05684 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.05684 in a Space README.md to link it from this page.

Collections including this paper 7