delinqu commited on
Commit
dcaab56
Β·
verified Β·
1 Parent(s): 8eab69d

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/arch.png filter=lfs diff=lfs merge=lfs -text
37
+ assets/embodiments.png filter=lfs diff=lfs merge=lfs -text
38
+ assets/logo.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,156 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <p align="center">
2
+ <img src="assets/logo.png" width="100%">
3
+ </p>
4
+
5
+ <p align="left">
6
+ <a href="http://eo-robotics.ai">
7
+ <img
8
+ src="https://img.shields.io/badge/EO--Robotics-Website-5865F2?logo=googleplay&logoColor=white"
9
+ alt="EO-Robotics Website"
10
+ />
11
+ </a>
12
+ <a href="https://arxiv.org/abs/TODO">
13
+ <img
14
+ src="https://img.shields.io/badge/EO--1-Paper-red?logo=arxiv&logoColor=red"
15
+ alt="EO-Robotics Paper on arXiv"
16
+ />
17
+ </a>
18
+ <a href="https://huggingface.co/collections/IPEC-COMMUNITY/eo-robotics-68ac4ff30e1f746cac28ca14">
19
+ <img
20
+ src="https://img.shields.io/badge/EO--1--3B-Model-FFCC11?logo=huggingface&logoColor=brightyellow"
21
+ alt="EO-1 Model"
22
+ />
23
+ </a>
24
+ <a href="https://huggingface.co/spaces/IPEC-COMMUNITY/EO-Robotics">
25
+ <img
26
+ src="https://img.shields.io/badge/EO--Robotics-Space-orange?logo=huggingface&logoColor=brightyellow"
27
+ alt="EO-Robotics Model"
28
+ />
29
+ </a>
30
+ <a href="https://discord.gg/JqfDs6va">
31
+ <img
32
+ src="https://img.shields.io/badge/EO--Robotics-Discord-155dfc?logo=discord&logoColor=lightblue"
33
+ alt="EO-Robotics Discord"
34
+ />
35
+ </a>
36
+ <a href="mailto:[email protected]">
37
+ <img
38
+ src="https://img.shields.io/badge/EO--Robotics-Email-D14836?logo=gmail&logoColor=red"
39
+ alt="EO-Robotics Email"
40
+ />
41
+ </a>
42
+ <a href="https://huggingface.co/datasets/IPEC-COMMUNITY/EO-Data1.5M">
43
+ <img
44
+ src="https://img.shields.io/badge/Dataset-EO--Data1.5M-brightgreen?logo=huggingface&logoColor=brightyellow"
45
+ alt="EO-1.5M"
46
+ />
47
+ </a>
48
+ </p>
49
+
50
+ ## Interleaved Vision-Text-Action Pretraining for General Robot Control
51
+
52
+ We introduce **EO-1** model, an open-source unified embodied foundation model comprising 3B parameters, trained on the carefully curated interleaved embodied dataset EO-Data1.5M, Web Multimodal Data, and Robot Control Data (AgiBotWorld, Open X-Embodiment, RoboMIND, SO100-Community, etc.). The **EO-1** model adopt a single unified decoder-only transformer that integrates discrete auto-regressive decoding with continuous flow matching denoising for multimodal embodied reasoning and robot control, enabling seamless perception, planning, reasoning, and acting in single model. This work highlights the following features:
53
+
54
+ - ⚑ **Unified Architecture**: A single decoder-only transformer integrating text, image, video, and actions.
55
+ - πŸ“š **EO-1.5M Dataset**: 1.5M high-quality interleaved samples (Physical, Reasoning, Spatial, Control).
56
+ - πŸŒ€ **Interleaved Pretraining**: Seamless synergy between language and action with autoregressive + flow matching.
57
+ - πŸ€– **Reasoning-Enhanced Generalization**: Superior generalization capabilities with multimodal embodied reasoning and real robot control.
58
+
59
+ <p align="left">
60
+ <img src="assets/embodiments.png" width="100%">
61
+ </p>
62
+
63
+ ## 0. Model Architecture
64
+
65
+ <p align="left">
66
+ <img src="assets/arch.png" width="100%">
67
+ </p>
68
+
69
+ **EO-1** model is a Vision-Language-Action (VLA) model that adopts a single unified decoder-only transformer, equipping with discrete language-modeling head for multimodal embodied reasoning and continuous flow-matching head for robot action generation. The language instruction, image observations, robot state, and noisy action are encoded into an interleaved token sequence of tokens to be processed by the shared transformer backbone, whose weights are initialized from Qwen2.5-VL. The model is trained on interleaved vision-text-action data with a combination of flow-matching objective and next-token-prediction objective and capable of seamless embodied reasoning and acting.
70
+
71
+ ### Input:
72
+ Input Type:
73
+
74
+ - Vision: Image Frames, Video
75
+ - State: Robot Proprioception
76
+ - Language Instruction: Text, Pointing, Bounding Box, etc.
77
+ - Input Format:
78
+
79
+ ### Output:
80
+
81
+ Output Type(s): Actions, Language
82
+
83
+ Output Format: Continuous-value vectors, Discrete Text
84
+
85
+
86
+ ## 1. Inference with pre-trained model
87
+ **EO-1** is built entirely on πŸ€— HuggingFace Transformers and Lerobot, making deployment straightforward and accessible. If your environment supports transformers and lerobot, you can load the model and run inference directly with just a few lines of code (requires ~6.5GB GPU memory). **EO-1** unifies high-level embodied reasoning with low-level robot control, producing either natural language outputs or actionable robot commands.
88
+
89
+ ```python
90
+ from transformers import AutoModel, AutoProcessor
91
+ # load the model and processor
92
+ processor = AutoProcessor.from_pretrained("IPEC-COMMUNITY/EO-1-3B", trust_remote_code=True)
93
+ model = AutoModel.from_pretrained(
94
+ "IPEC-COMMUNITY/EO-1-3B",
95
+ trust_remote_code=True,
96
+ torch_dtype=torch.bfloat16
97
+ ).eval().cuda()
98
+
99
+ # prepare the model input
100
+ batch = {
101
+ "observation.images.image": [img], # PIL.Image
102
+ "observation.images.wrist_image": [wrist_img],
103
+ "observation.state": [state],
104
+ "task": ["You are a helpful physical agent equipped with both reasoning and robotic control. \
105
+ You see the Tic-Tac-Toe board, think strategically, act logically, and block threats."]
106
+ }
107
+
108
+ # generate multimodal outputs
109
+ output = processor.generate(model, batch)
110
+ text = output.text
111
+ actions = output.action.numpy()
112
+ ```
113
+
114
+ ## Benchmark
115
+
116
+ Mastering Diverse Manipulations on Multiple Embodiments
117
+
118
+ | Model | Franka Pick-and-Place (7 Tasks) | AgiBot Long-horizon Dexterity (4 Tasks) | WidowX Out-of-Box (13 Tasks) | Reasoning Control (4 Tasks) |
119
+ |--------------|---------------------------------|-----------------------------------------|------------------------------|-----------------------------|
120
+ | $\pi_0$-fast | 0.610 | 0.449 | 0.227 | β€” |
121
+ | $\pi_0$ | 0.831 | 0.672 | 0.693 | 0.525 |
122
+ | GR00T-N1.5 | 0.857 | 0.681 | 0.705 | 0.617 |
123
+ | **EO-1** | **0.935** | **0.807** | **0.852** | **0.831** |
124
+
125
+ Multi-modal Benchmark Results
126
+
127
+ | Model | RoboVQA | ERQA | EO-Bench @ Spatial | EO-Bench @ Temporal | Overall |
128
+ |---------------------|----------|----------|--------------------|---------------------|----------|
129
+ | Claude 3.5 | 26.7 | 35.5 | 24.0 | 34.8 | 30.3 |
130
+ | GPT-4o (2024-11-20) | 47.2 | 40.0 | 35.6 | 39.3 | 40.5 |
131
+ | Qwen2.5 VL 3B | 55.9 | 35.3 | 20.0 | 22.6 | 33.5 |
132
+ | Magma 8B | 30.3 | 29.3 | 29.4 | 36.7 | 31.4 |
133
+ | **EO-1 (3B)** | **58.5** | **45.5** | **36.4** | **38.9** | **44.8** |
134
+
135
+ Robot Control Benchmark Results
136
+
137
+ | Model | LIBERO | Simpler @ Google VM | Simpler @ Google VA | Simpler @ WidowX VM |
138
+ |--------------|-----------|---------------------|---------------------|---------------------|
139
+ | $\pi_0$ | 0.942 | 0.714 | 0.714 | 0.692 |
140
+ | $\pi_0$-fast | 0.855 | 0.464 | 0.464 | 0.321 |
141
+ | GR00T-N1 | 0.939 | β€” | β€” | β€” |
142
+ | Magma | β€” | 0.488 | 0.488 | 0.448 |
143
+ | **EO-1** | **0.982** | **0.765** | **0.765** | **0.727** |
144
+
145
+ ## πŸ“š Citation
146
+
147
+ If you find this project useful, please consider citing:
148
+
149
+ ```bibtex
150
+ @article{eo-robotics,
151
+ title={EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control},
152
+ author={Qu, Delin and Song, Haoming and Chen, Qizhi and Chen, Zhaoqing, and Gao Xianqiang, and Ye, Xinyi, and Modi Shi, and Guanghui Ren and Maoqing Yao, and Zhao, Bin and Wang, Dong},
153
+ journal={arXiv preprint},
154
+ year={2025}
155
+ }
156
+ ```
assets/.DS_Store ADDED
Binary file (6.15 kB). View file
 
assets/arch.png ADDED

Git LFS Details

  • SHA256: 6fd8c21cebb06613388df41b60e83181c9c80f1258f37b68ecb1a4857148263b
  • Pointer size: 131 Bytes
  • Size of remote file: 490 kB
assets/embodiments.png ADDED

Git LFS Details

  • SHA256: fa9cbfd977a72a72483dc5c408e7c7ef2e7828f9a1696a027833d6c8bda8523f
  • Pointer size: 132 Bytes
  • Size of remote file: 1.79 MB
assets/logo.png ADDED

Git LFS Details

  • SHA256: 7e80b822b68840dbc7cd1ceda47e2b9da5e9469b7a426ce20d36fc9d6c261c9b
  • Pointer size: 131 Bytes
  • Size of remote file: 345 kB