Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

.gitattributes +3 -0
README.md +156 -3
assets/.DS_Store +0 -0
assets/arch.png +3 -0
assets/embodiments.png +3 -0
assets/logo.png +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+assets/arch.png filter=lfs diff=lfs merge=lfs -text
+assets/embodiments.png filter=lfs diff=lfs merge=lfs -text
+assets/logo.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,156 @@
----
-license: mit
----

+<p align="center">
+  <img src="assets/logo.png" width="100%">
+</p>
+<p align="left">
+  <a href="http://eo-robotics.ai">
+    <img
+      src="https://img.shields.io/badge/EO--Robotics-Website-5865F2?logo=googleplay&logoColor=white"
+      alt="EO-Robotics Website"
+    />
+  </a>
+  <a href="https://arxiv.org/abs/TODO">
+    <img
+      src="https://img.shields.io/badge/EO--1-Paper-red?logo=arxiv&logoColor=red"
+      alt="EO-Robotics Paper on arXiv"
+    />
+  </a>
+  <a href="https://huggingface.co/collections/IPEC-COMMUNITY/eo-robotics-68ac4ff30e1f746cac28ca14">
+    <img
+        src="https://img.shields.io/badge/EO--1--3B-Model-FFCC11?logo=huggingface&logoColor=brightyellow"
+        alt="EO-1 Model"
+    />
+  </a>
+  <a href="https://huggingface.co/spaces/IPEC-COMMUNITY/EO-Robotics">
+    <img
+        src="https://img.shields.io/badge/EO--Robotics-Space-orange?logo=huggingface&logoColor=brightyellow"
+        alt="EO-Robotics Model"
+    />
+  </a>
+  <a href="https://discord.gg/JqfDs6va">
+    <img
+      src="https://img.shields.io/badge/EO--Robotics-Discord-155dfc?logo=discord&logoColor=lightblue"
+      alt="EO-Robotics Discord"
+    />
+  </a>
+  <a href="mailto:[email protected]">
+    <img
+      src="https://img.shields.io/badge/EO--Robotics-Email-D14836?logo=gmail&logoColor=red"
+      alt="EO-Robotics Email"
+    />
+  </a>
+  <a href="https://huggingface.co/datasets/IPEC-COMMUNITY/EO-Data1.5M">
+    <img
+      src="https://img.shields.io/badge/Dataset-EO--Data1.5M-brightgreen?logo=huggingface&logoColor=brightyellow"
+      alt="EO-1.5M"
+    />
+  </a>
+</p>
+## Interleaved Vision-Text-Action Pretraining for General Robot Control
+We introduce **EO-1** model, an open-source unified embodied foundation model comprising 3B parameters, trained on the carefully curated interleaved embodied dataset EO-Data1.5M, Web Multimodal Data, and Robot Control Data (AgiBotWorld, Open X-Embodiment, RoboMIND, SO100-Community, etc.). The **EO-1** model adopt a single unified decoder-only transformer that integrates discrete auto-regressive decoding with continuous flow matching denoising for multimodal embodied reasoning and robot control, enabling seamless perception, planning, reasoning, and acting in single model. This work highlights the following features:
+- ⚡ **Unified Architecture**: A single decoder-only transformer integrating text, image, video, and actions.
+- 📚 **EO-1.5M Dataset**: 1.5M high-quality interleaved samples (Physical, Reasoning, Spatial, Control).
+- 🌀 **Interleaved Pretraining**: Seamless synergy between language and action with autoregressive + flow matching.
+- 🤖 **Reasoning-Enhanced Generalization**: Superior generalization capabilities with multimodal embodied reasoning and real robot control.
+<p align="left">
+  <img src="assets/embodiments.png" width="100%">
+</p>
+## 0. Model Architecture
+<p align="left">
+  <img src="assets/arch.png" width="100%">
+</p>
+**EO-1** model is a Vision-Language-Action (VLA) model that adopts a single unified decoder-only transformer, equipping with discrete language-modeling head for multimodal embodied reasoning and continuous flow-matching head for robot action generation. The language instruction, image observations, robot state, and noisy action are encoded into an interleaved token sequence of tokens to be processed by the shared transformer backbone, whose weights are initialized from Qwen2.5-VL. The model is trained on interleaved vision-text-action data with a combination of flow-matching objective and next-token-prediction objective and capable of seamless embodied reasoning and acting.
+### Input:
+Input Type:
+- Vision: Image Frames, Video
+- State: Robot Proprioception
+- Language Instruction: Text, Pointing, Bounding Box, etc.
+- Input Format:
+### Output:
+Output Type(s): Actions, Language
+Output Format: Continuous-value vectors, Discrete Text
+## 1. Inference with pre-trained model
+**EO-1** is built entirely on 🤗 HuggingFace Transformers and Lerobot, making deployment straightforward and accessible. If your environment supports transformers and lerobot, you can load the model and run inference directly with just a few lines of code (requires ~6.5GB GPU memory). **EO-1** unifies high-level embodied reasoning with low-level robot control, producing either natural language outputs or actionable robot commands.
+```python
+from transformers import AutoModel, AutoProcessor
+# load the model and processor
+processor = AutoProcessor.from_pretrained("IPEC-COMMUNITY/EO-1-3B", trust_remote_code=True)
+model = AutoModel.from_pretrained(
+  "IPEC-COMMUNITY/EO-1-3B",
+  trust_remote_code=True,
+  torch_dtype=torch.bfloat16
+).eval().cuda()
+# prepare the model input
+batch = {
+    "observation.images.image": [img], # PIL.Image
+    "observation.images.wrist_image": [wrist_img],
+    "observation.state": [state],
+    "task": ["You are a helpful physical agent equipped with both reasoning and robotic control. \
+      You see the Tic-Tac-Toe board, think strategically, act logically, and block threats."]
+}
+# generate multimodal outputs
+output = processor.generate(model, batch)
+text = output.text
+actions = output.action.numpy()
+```
+## Benchmark
+Mastering Diverse Manipulations on Multiple Embodiments
+| Model        | Franka Pick-and-Place (7 Tasks) | AgiBot Long-horizon Dexterity (4 Tasks) | WidowX Out-of-Box (13 Tasks) | Reasoning Control (4 Tasks) |
+|--------------|---------------------------------|-----------------------------------------|------------------------------|-----------------------------|
+| $\pi_0$-fast | 0.610                           | 0.449                                   | 0.227                        | —                           |
+| $\pi_0$      | 0.831                           | 0.672                                   | 0.693                        | 0.525                       |
+| GR00T-N1.5   | 0.857                           | 0.681                                   | 0.705                        | 0.617                       |
+| **EO-1**     | **0.935**                       | **0.807**                               | **0.852**                    | **0.831**                   |
+Multi-modal Benchmark Results
+| Model               | RoboVQA  | ERQA     | EO-Bench @ Spatial | EO-Bench @ Temporal | Overall  |
+|---------------------|----------|----------|--------------------|---------------------|----------|
+| Claude 3.5          | 26.7     | 35.5     | 24.0               | 34.8                | 30.3     |
+| GPT-4o (2024-11-20) | 47.2     | 40.0     | 35.6               | 39.3                | 40.5     |
+| Qwen2.5 VL 3B       | 55.9     | 35.3     | 20.0               | 22.6                | 33.5     |
+| Magma 8B            | 30.3     | 29.3     | 29.4               | 36.7                | 31.4     |
+| **EO-1 (3B)**       | **58.5** | **45.5** | **36.4**           | **38.9**            | **44.8** |
+Robot Control Benchmark Results
+| Model        | LIBERO    | Simpler @ Google VM | Simpler @ Google VA | Simpler @ WidowX VM |
+|--------------|-----------|---------------------|---------------------|---------------------|
+| $\pi_0$      | 0.942     | 0.714               | 0.714               | 0.692               |
+| $\pi_0$-fast | 0.855     | 0.464               | 0.464               | 0.321               |
+| GR00T-N1     | 0.939     | —                   | —                   | —                   |
+| Magma        | —         | 0.488               | 0.488               | 0.448               |
+| **EO-1**     | **0.982** | **0.765**           | **0.765**           | **0.727**           |
+## 📚 Citation
+If you find this project useful, please consider citing:
+```bibtex
+@article{eo-robotics,
+  title={EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control},
+  author={Qu, Delin and Song, Haoming and Chen, Qizhi and Chen, Zhaoqing, and Gao Xianqiang, and Ye, Xinyi, and Modi Shi, and Guanghui Ren and Maoqing Yao, and Zhao, Bin and Wang, Dong},
+  journal={arXiv preprint},
+  year={2025}
+}
+```

assets/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

assets/arch.png ADDED Viewed

Git LFS Details

SHA256: 6fd8c21cebb06613388df41b60e83181c9c80f1258f37b68ecb1a4857148263b
Pointer size: 131 Bytes
Size of remote file: 490 kB

assets/embodiments.png ADDED Viewed

Git LFS Details

SHA256: fa9cbfd977a72a72483dc5c408e7c7ef2e7828f9a1696a027833d6c8bda8523f
Pointer size: 132 Bytes
Size of remote file: 1.79 MB

assets/logo.png ADDED Viewed

Git LFS Details

SHA256: 7e80b822b68840dbc7cd1ceda47e2b9da5e9469b7a426ce20d36fc9d6c261c9b
Pointer size: 131 Bytes
Size of remote file: 345 kB