Enhance model card for Look, Focus, Act robot learning model
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
@@ -1,10 +1,87 @@
|
|
1 |
---
|
|
|
2 |
tags:
|
3 |
- model_hub_mixin
|
4 |
- pytorch_model_hub_mixin
|
|
|
|
|
|
|
5 |
---
|
6 |
|
7 |
-
|
8 |
-
|
9 |
-
-
|
10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
license: apache-2.0
|
3 |
tags:
|
4 |
- model_hub_mixin
|
5 |
- pytorch_model_hub_mixin
|
6 |
+
- lerobot
|
7 |
+
pipeline_tag: robotics
|
8 |
+
library_name: lerobot
|
9 |
---
|
10 |
|
11 |
+
# Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers
|
12 |
+
|
13 |
+
This repository contains the `gaze_model_av_aloha_sim_cube_transfer` model, a pre-trained gaze model for the AV-ALOHA simulation environment. It is part of the research presented in the paper [Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers](https://huggingface.co/papers/2507.15833).
|
14 |
+
|
15 |
+
This work proposes a human-inspired *foveated vision framework* for robot learning that combines human gaze, foveated Vision Transformers (ViTs), and robotic control to enable policies that are both efficient and robust.
|
16 |
+
|
17 |
+
[\ud83d\udcda Paper](https://huggingface.co/papers/2507.15833) | [\ud83c\udf10 Project Website](https://ian-chuang.github.io/gaze-av-aloha/) | [\ud83d\udcbb Code](https://github.com/ian-chuang/gaze-av-aloha)
|
18 |
+
|
19 |
+
<p align="center">
|
20 |
+
<img src="https://github.com/ian-chuang/gaze-av-aloha/raw/main/media/hero.gif" alt="Hero GIF" width="100%">
|
21 |
+
</p>
|
22 |
+
|
23 |
+
## Abstract
|
24 |
+
Human vision is a highly active process driven by gaze, which directs attention and fixation to task-relevant regions and dramatically reduces visual processing. In contrast, robot learning systems typically rely on passive, uniform processing of raw camera images. In this work, we explore how incorporating human-like active gaze into robotic policies can enhance both efficiency and performance. We build on recent advances in foveated image processing and apply them to an Active Vision robot system that emulates both human head movement and eye tracking. Extending prior work on the AV-ALOHA robot simulation platform, we introduce a framework for simultaneously collecting eye-tracking data and robot demonstrations from a human operator as well as a simulation benchmark and dataset for training robot policies that incorporate human gaze. Given the widespread use of Vision Transformers (ViTs) in robot learning, we integrate gaze information into ViTs using a foveated patch tokenization scheme inspired by recent work in image segmentation. Compared to uniform patch tokenization, this significantly reduces the number of tokens-and thus computation-without sacrificing visual fidelity near regions of interest. We also explore two approaches to gaze imitation and prediction from human data. The first is a two-stage model that predicts gaze to guide foveation and action; the second integrates gaze into the action space, allowing the policy to jointly predict gaze and actions end-to-end. Our results show that our method for foveated robot vision not only drastically reduces computational overhead, but also improves performance for high precision tasks and robustness to unseen distractors. Together, these findings suggest that human-inspired visual processing offers a useful inductive bias for robotic vision systems.
|
25 |
+
|
26 |
+
## Installation
|
27 |
+
|
28 |
+
To work with this model within the `gaze-av-aloha` framework, you need to set up the environment by following the instructions from the official GitHub repository:
|
29 |
+
|
30 |
+
```bash
|
31 |
+
# Clone the repository and initialize submodules
|
32 |
+
git clone https://github.com/ian-chuang/gaze-av-aloha.git
|
33 |
+
cd gaze-av-aloha
|
34 |
+
git submodule init
|
35 |
+
git submodule update
|
36 |
+
|
37 |
+
# Create and activate a new Conda environment
|
38 |
+
conda create -n gaze python=3.10
|
39 |
+
conda activate gaze
|
40 |
+
|
41 |
+
# Install LeRobot
|
42 |
+
pip install git+https://github.com/huggingface/lerobot.git@483be9aac217c2d8ef16982490f22b2ad091ab46
|
43 |
+
|
44 |
+
# Install FFmpeg for video logging
|
45 |
+
conda install ffmpeg=7.1.1 -c conda-forge
|
46 |
+
|
47 |
+
# Install AV-ALOHA packages
|
48 |
+
pip install -e ./gym_av_aloha
|
49 |
+
pip install -e ./gaze_av_aloha
|
50 |
+
```
|
51 |
+
|
52 |
+
## Sample Usage
|
53 |
+
|
54 |
+
This model (`iantc104/gaze_model_av_aloha_sim_cube_transfer`) is a pre-trained gaze prediction model. It is designed to be integrated into the `gaze-av-aloha` framework's policy training scripts. For example, to train a robot policy using this pre-trained gaze model (following the Fov-UNet approach) for the `cube_transfer` task, you would use a command similar to the following:
|
55 |
+
|
56 |
+
```bash
|
57 |
+
python gaze_av_aloha/scripts/train.py \
|
58 |
+
policy=foveated_vit_policy \
|
59 |
+
task=av_aloha_sim_cube_transfer \
|
60 |
+
policy.use_gaze_as_action=false \
|
61 |
+
policy.gaze_model_repo_id=iantc104/gaze_model_av_aloha_sim_cube_transfer \
|
62 |
+
policy.vision_encoder_kwargs.repo_id=iantc104/mae_vitb_foveated_vit \
|
63 |
+
policy.optimizer_lr_backbone=1e-5 \
|
64 |
+
wandb.enable=true \
|
65 |
+
wandb.project=<your_project_name> \
|
66 |
+
wandb.entity=<your_wandb_entity> \
|
67 |
+
wandb.job_name=fov-unet \
|
68 |
+
device=cuda
|
69 |
+
```
|
70 |
+
|
71 |
+
For more detailed usage, alternative policy training methods (like Fov-Act), and other available tasks, please refer to the comprehensive documentation and scripts in the official [gaze-av-aloha GitHub repository](https://github.com/ian-chuang/gaze-av-aloha).
|
72 |
+
|
73 |
+
## Citation
|
74 |
+
|
75 |
+
If you find this work useful for your research, please consider citing the original paper:
|
76 |
+
|
77 |
+
```bibtex
|
78 |
+
@misc{chuang2025lookfocusactefficient,
|
79 |
+
title={Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers},
|
80 |
+
author={Ian Chuang and Andrew Lee and Dechen Gao and Jinyu Zou and Iman Soltani},
|
81 |
+
year={2025},
|
82 |
+
eprint={2507.15833},
|
83 |
+
archivePrefix={arXiv},
|
84 |
+
primaryClass={cs.RO},
|
85 |
+
url={https://arxiv.org/abs/2507.15833},
|
86 |
+
}
|
87 |
+
```
|