Papers
arxiv:2502.16587

Human2Robot: Learning Robot Actions from Paired Human-Robot Videos

Published on Feb 23
Authors:
,
,
,
,
,
,
,
,
,

Abstract

A diffusion-based framework, Human2Robot, effectively translates human demonstrations into robotic actions, capturing temporal dynamics and achieving high-quality video generation and generalization in various tasks.

AI-generated summary

Distilling knowledge from human demonstrations is a promising way for robots to learn and act. Existing work often overlooks the differences between humans and robots, producing unsatisfactory results. In this paper, we study how perfectly aligned human-robot pairs benefit robot learning. Capitalizing on VR-based teleportation, we introduce H\&R, a third-person dataset with 2,600 episodes, each of which captures the fine-grained correspondence between human hand and robot gripper. Inspired by the recent success of diffusion models, we introduce Human2Robot, an end-to-end diffusion framework that formulates learning from human demonstration as a generative task. Human2Robot fully explores temporal dynamics in human videos to generate robot videos and predict actions at the same time. Through comprehensive evaluations of 4 carefully selected tasks in real-world settings, we demonstrate that Human2Robot can not only generate high-quality robot videos but also excels in seen tasks and generalizing to different positions, unseen appearances, novel instances, and even new backgrounds and task types.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.16587 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.16587 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.