Human2Robot: Learning Robot Actions from Paired Human-Robot Videos
Abstract
A diffusion-based framework, Human2Robot, effectively translates human demonstrations into robotic actions, capturing temporal dynamics and achieving high-quality video generation and generalization in various tasks.
Distilling knowledge from human demonstrations is a promising way for robots to learn and act. Existing work often overlooks the differences between humans and robots, producing unsatisfactory results. In this paper, we study how perfectly aligned human-robot pairs benefit robot learning. Capitalizing on VR-based teleportation, we introduce H\&R, a third-person dataset with 2,600 episodes, each of which captures the fine-grained correspondence between human hand and robot gripper. Inspired by the recent success of diffusion models, we introduce Human2Robot, an end-to-end diffusion framework that formulates learning from human demonstration as a generative task. Human2Robot fully explores temporal dynamics in human videos to generate robot videos and predict actions at the same time. Through comprehensive evaluations of 4 carefully selected tasks in real-world settings, we demonstrate that Human2Robot can not only generate high-quality robot videos but also excels in seen tasks and generalizing to different positions, unseen appearances, novel instances, and even new backgrounds and task types.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper