File size: 6,808 Bytes
d09ed64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37d50d6
d09ed64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89193b0
 
 
 
 
 
 
 
 
 
 
 
37d50d6
 
 
 
 
 
 
 
d09ed64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37d50d6
d09ed64
 
 
 
 
37d50d6
 
 
 
d09ed64
 
 
37d50d6
d09ed64
 
 
 
 
 
37d50d6
d09ed64
 
 
37d50d6
 
d09ed64
 
37d50d6
 
 
d09ed64
 
 
 
 
 
 
37d50d6
d09ed64
 
 
 
37d50d6
d09ed64
 
 
 
89193b0
d09ed64
89193b0
d09ed64
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
license: mit
tags:
- vla
- robotics
- multimodal
- autoregressive
library_name: transformers
pipeline_tag: robotics
---

# Being-H0: Vision-Language-Action Pretraining from Large-Scale  Human Videos

<p align="center">
    <img src="https://raw.githubusercontent.com/BeingBeyond/Being-H0/refs/heads/main/docs/assets/image/being-h0-black.png" width="300"/>
<p>

<div align="center">

[![Project Page](https://img.shields.io/badge/Website-Being--H0-green)](https://beingbeyond.github.io/Being-H0)
[![arXiv](https://img.shields.io/badge/arXiv-2507.15597-b31b1b.svg)](https://arxiv.org/abs/2507.15597)
[![Model](https://img.shields.io/badge/GitHub-Being--H0-white)](https://huggingface.co/BeingBeyond/Being-H0)
[![License](https://img.shields.io/badge/License-MIT-blue.svg)](./LICENSE)

</div>

<p align="center">
    <img src="https://raw.githubusercontent.com/BeingBeyond/Being-H0/refs/heads/main/docs/assets/image/overview.png"/>
<p>


We introduce **Being-H0**, the first dexterous Vision-Language-Action model pretrained from large-scale human videos via explicit hand motion modeling.

## News

- **[2025-08-02]**: We release the **Being-H0** codebase and pretrained models! Check our [Hugging Face Model Hub](https://huggingface.co/BeingBeyond/Being-H0) for more details. πŸ”₯πŸ”₯πŸ”₯
- **[2025-07-21]**: We publish **Being-H0**! Check our paper [here](https://arxiv.org/abs/2507.15597). 🌟🌟🌟

## Model Checkpoints

Download pre-trained models from Hugging Face:

| Model Type | Model Name | Parameters | Description |
|------------|------------|------------|-------------|
| **Motion Model** | [Being-H0-GRVQ-8K](https://huggingface.co/BeingBeyond/Being-H0-GRVQ-8K) | - | Motion tokenizer |
| **VLA Pre-trained** | [Being-H0-1B-2508](https://huggingface.co/BeingBeyond/Being-H0-1B-2508) | 1B | Base vision-language-action model |
| **VLA Pre-trained** | [Being-H0-8B-2508](https://huggingface.co/BeingBeyond/Being-H0-8B-2508) | 8B | Base vision-language-action model |
| **VLA Pre-trained** | [Being-H0-14B-2508](https://huggingface.co/BeingBeyond/Being-H0-14B-2508) | 14B | Base vision-language-action model |
| **VLA Post-trained** | [Being-H0-8B-Align-2508](https://huggingface.co/BeingBeyond/Being-H0-8B-Align-2508) | 8B | Fine-tuned for robot alignment |

## Dataset

We have provided the dataset for post-training the VLA model. The dataset is available in Hugging Face:

| Dataset Type | Dataset Name | Description |
|--------------|--------------|-------------|
| **VLA Post-training** | [h0_post_train_db_2508](https://huggingface.co/datasets/BeingBeyond/h0_post_train_db_2508) | Post-training dataset for pretrained Being-H0 VLA model |

## Setup

### Clone repository

```bash
git clone https://github.com/BeingBeyond/Being-H0.git
cd Being-H0
```

### Create environment
```bash
conda env create -f environment.yml
conda activate beingvla
```

### Install package
```bash
pip install flash-attn --no-build-isolation
pip install git+https://github.com/lixiny/manotorch.git
pip install git+https://github.com/mattloper/chumpy.git
```

### Download MANO package

- Visit [MANO website](http://mano.is.tue.mpg.de/)
- Create an account by clicking _Sign Up_ and provide your information
- Download Models and Code (the downloaded file should have the format `mano_v*_*.zip`). Note that all code and data from this download falls under the [MANO license](http://mano.is.tue.mpg.de/license).
- Unzip and copy the contents in `mano_v*_*/` folder to the `beingvla/models/motion/mano/` folder

## Inference

### Motion Generation

- To generate hand motion tokens and render the motion, you should use the Motion Model (`Being-H0-GRVQ-8K`) and the pretrained VLA model (`Being-H0-{1B,8B,14B}-2508`). 
- You can use the following command to inference. For the `--motion_code_path`, you should use a `+` symbol to jointly specify the wrist and finger motion code paths, e.g., `--motion_code_path "/path/to/Being-H0-GRVQ-8K/wrist/+/path/to/Being-H0-GRVQ-8K/finger/"`.
- The `--hand_mode` can be set to `left`, `right`, or `both` to specify which hand to use for the task.

```bash
python -m beingvla.inference.vla_internvl_inference \
    --model_path /path/to/Being-H0-XXX \
    --motion_code_path "/path/to/Being-H0-GRVQ-8K/wrist/+/path/to/Being-H0-GRVQ-8K/finger/" \
    --input_image ./playground/unplug_airpods.jpg \
    --task_description "unplug the charging cable from the AirPods" \
    --hand_mode both \
    --num_samples 3 \
    --num_seconds 4 \
    --enable_render true \
    --gpu_device 0 \
    --output_dir ./work_dirs/
```

- **To inference on your own photos**: See [Camera Intrinsics Guide](https://github.com/BeingBeyond/Being-H0/blob/main/docs/camera_intrinsics.md) for how to estimate camera intrinsics and input them for custom inference.

### Evaluation

- You can use our pretrained VLA model to post-train on real robot data. When you get your post-trained model (e.g., `Being-H0-8B-Align-2508`), you can use the following commands to communicate with real robot, or evaluate the model on a robot task.

- Setup robot communication:

```bash
python -m beingvla.models.motion.m2m.aligner.run_server \
    --model-path /path/to/Being-H0-XXX-Align \
    --port 12305 \
    --action-chunk-length 16
```
- Run evaluation on robot task:

```bash
python -m beingvla.models.motion.m2m.aligner.eval_policy \
    --model-path /path/to/Being-H0-XXX-Align \
    --zarr-path /path/to/real-robot/data \
    --task_description "Put the little white duck into the cup." \
    --action-chunk-length 16
```

## Contributing and Building on Being-H0

We encourage researchers and practitioners to leverage Being-H0 as a foundation for their own creative experiments and applications. Whether you're adapting Being-H0 to new robotic platforms, exploring novel hand manipulation tasks, or extending the model to new domains, our modular codebase is designed to support your innovations. We welcome contributions of all kinds - from bug fixes and documentation improvements to new features and model architectures. By building on Being-H0 together, we can advance the field of dexterous vision-language-action modeling and enable robots to understand and replicate the rich complexity of human hand movements. Join us in making robotic manipulation more intuitive, capable, and accessible to all.

## Citation
If you find our work useful, please consider citing us and give a star to our repository! 🌟🌟🌟

**Being-H0**

```bibtex
@article{beingbeyond2025beingh0,
  title={Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos},
  author={Luo, Hao and Feng, Yicheng and Zhang, Wanpeng and Zheng, Sipeng and Wang, Ye and Yuan, Haoqi and Liu, Jiazheng and Xu, Chaoyi and Jin, Qin and Lu, Zongqing},
  journal={arXiv preprint arXiv:2507.15597},
  year={2025}
}
```