nielsr HF Staff commited on
Commit
42ff78f
·
verified ·
1 Parent(s): b66a487

Enhance model card for Look, Focus, Act robot learning model

Browse files

This PR significantly improves the model card for the `gaze_model_av_aloha_sim_cube_transfer` model by:

- Adding the `apache-2.0` license.
- Setting the `pipeline_tag` to `robotics` for better discoverability.
- Specifying `lerobot` as the `library_name`, reflecting its integration with the LeRobot framework.
- Including direct links to the associated Hugging Face paper, project website, and the official GitHub repository.
- Incorporating the paper's abstract and a hero GIF for comprehensive context.
- Providing detailed installation instructions and a sample usage command that illustrates how this gaze model is used within the `gaze-av_aloha` framework for training robot policies, aligning with the project's primary use case.

This update makes the model card much more informative and user-friendly for researchers and practitioners interested in robot learning with foveated vision.

Files changed (1) hide show
  1. README.md +81 -4
README.md CHANGED
@@ -1,10 +1,87 @@
1
  ---
 
2
  tags:
3
  - model_hub_mixin
4
  - pytorch_model_hub_mixin
 
 
 
5
  ---
6
 
7
- This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
8
- - Code: [More Information Needed]
9
- - Paper: [More Information Needed]
10
- - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
  tags:
4
  - model_hub_mixin
5
  - pytorch_model_hub_mixin
6
+ - lerobot
7
+ pipeline_tag: robotics
8
+ library_name: lerobot
9
  ---
10
 
11
+ # Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers
12
+
13
+ This repository contains the `gaze_model_av_aloha_sim_cube_transfer` model, a pre-trained gaze model for the AV-ALOHA simulation environment. It is part of the research presented in the paper [Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers](https://huggingface.co/papers/2507.15833).
14
+
15
+ This work proposes a human-inspired *foveated vision framework* for robot learning that combines human gaze, foveated Vision Transformers (ViTs), and robotic control to enable policies that are both efficient and robust.
16
+
17
+ [\ud83d\udcda Paper](https://huggingface.co/papers/2507.15833) | [\ud83c\udf10 Project Website](https://ian-chuang.github.io/gaze-av-aloha/) | [\ud83d\udcbb Code](https://github.com/ian-chuang/gaze-av-aloha)
18
+
19
+ <p align="center">
20
+ <img src="https://github.com/ian-chuang/gaze-av-aloha/raw/main/media/hero.gif" alt="Hero GIF" width="100%">
21
+ </p>
22
+
23
+ ## Abstract
24
+ Human vision is a highly active process driven by gaze, which directs attention and fixation to task-relevant regions and dramatically reduces visual processing. In contrast, robot learning systems typically rely on passive, uniform processing of raw camera images. In this work, we explore how incorporating human-like active gaze into robotic policies can enhance both efficiency and performance. We build on recent advances in foveated image processing and apply them to an Active Vision robot system that emulates both human head movement and eye tracking. Extending prior work on the AV-ALOHA robot simulation platform, we introduce a framework for simultaneously collecting eye-tracking data and robot demonstrations from a human operator as well as a simulation benchmark and dataset for training robot policies that incorporate human gaze. Given the widespread use of Vision Transformers (ViTs) in robot learning, we integrate gaze information into ViTs using a foveated patch tokenization scheme inspired by recent work in image segmentation. Compared to uniform patch tokenization, this significantly reduces the number of tokens-and thus computation-without sacrificing visual fidelity near regions of interest. We also explore two approaches to gaze imitation and prediction from human data. The first is a two-stage model that predicts gaze to guide foveation and action; the second integrates gaze into the action space, allowing the policy to jointly predict gaze and actions end-to-end. Our results show that our method for foveated robot vision not only drastically reduces computational overhead, but also improves performance for high precision tasks and robustness to unseen distractors. Together, these findings suggest that human-inspired visual processing offers a useful inductive bias for robotic vision systems.
25
+
26
+ ## Installation
27
+
28
+ To work with this model within the `gaze-av-aloha` framework, you need to set up the environment by following the instructions from the official GitHub repository:
29
+
30
+ ```bash
31
+ # Clone the repository and initialize submodules
32
+ git clone https://github.com/ian-chuang/gaze-av-aloha.git
33
+ cd gaze-av-aloha
34
+ git submodule init
35
+ git submodule update
36
+
37
+ # Create and activate a new Conda environment
38
+ conda create -n gaze python=3.10
39
+ conda activate gaze
40
+
41
+ # Install LeRobot
42
+ pip install git+https://github.com/huggingface/lerobot.git@483be9aac217c2d8ef16982490f22b2ad091ab46
43
+
44
+ # Install FFmpeg for video logging
45
+ conda install ffmpeg=7.1.1 -c conda-forge
46
+
47
+ # Install AV-ALOHA packages
48
+ pip install -e ./gym_av_aloha
49
+ pip install -e ./gaze_av_aloha
50
+ ```
51
+
52
+ ## Sample Usage
53
+
54
+ This model (`iantc104/gaze_model_av_aloha_sim_cube_transfer`) is a pre-trained gaze prediction model. It is designed to be integrated into the `gaze-av-aloha` framework's policy training scripts. For example, to train a robot policy using this pre-trained gaze model (following the Fov-UNet approach) for the `cube_transfer` task, you would use a command similar to the following:
55
+
56
+ ```bash
57
+ python gaze_av_aloha/scripts/train.py \
58
+ policy=foveated_vit_policy \
59
+ task=av_aloha_sim_cube_transfer \
60
+ policy.use_gaze_as_action=false \
61
+ policy.gaze_model_repo_id=iantc104/gaze_model_av_aloha_sim_cube_transfer \
62
+ policy.vision_encoder_kwargs.repo_id=iantc104/mae_vitb_foveated_vit \
63
+ policy.optimizer_lr_backbone=1e-5 \
64
+ wandb.enable=true \
65
+ wandb.project=<your_project_name> \
66
+ wandb.entity=<your_wandb_entity> \
67
+ wandb.job_name=fov-unet \
68
+ device=cuda
69
+ ```
70
+
71
+ For more detailed usage, alternative policy training methods (like Fov-Act), and other available tasks, please refer to the comprehensive documentation and scripts in the official [gaze-av-aloha GitHub repository](https://github.com/ian-chuang/gaze-av-aloha).
72
+
73
+ ## Citation
74
+
75
+ If you find this work useful for your research, please consider citing the original paper:
76
+
77
+ ```bibtex
78
+ @misc{chuang2025lookfocusactefficient,
79
+ title={Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers},
80
+ author={Ian Chuang and Andrew Lee and Dechen Gao and Jinyu Zou and Iman Soltani},
81
+ year={2025},
82
+ eprint={2507.15833},
83
+ archivePrefix={arXiv},
84
+ primaryClass={cs.RO},
85
+ url={https://arxiv.org/abs/2507.15833},
86
+ }
87
+ ```