nielsr HF Staff commited on
Commit
f816c2a
·
verified ·
1 Parent(s): 39d6c88

Add project page and link to paper

Browse files

This PR adds the project page and links the model to [From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models](https://huggingface.co/papers/2506.09930).

Files changed (1) hide show
  1. README.md +5 -0
README.md CHANGED
@@ -2,8 +2,13 @@
2
  license: mit
3
  pipeline_tag: robotics
4
  ---
 
5
  # Octo Base
6
 
 
 
 
 
7
  See https://github.com/octo-models/octo for instructions for using this model.
8
 
9
  Octo Base is trained with a window size of 2, predicting 7-dimensional actions 4 steps into the future using a diffusion policy. The model is a Transformer with 93M parameters (equivalent to a ViT-B). Images are tokenized by preprocessing with a lightweight convolutional encoder, then grouped into 16x16 patches. Language is tokenized by applying the T5 tokenizer, and then applying the T5-Base language encoder.
 
2
  license: mit
3
  pipeline_tag: robotics
4
  ---
5
+
6
  # Octo Base
7
 
8
+ This model is described in [From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models](https://huggingface.co/papers/2506.09930).
9
+
10
+ Project Page: https://ai4ce.github.io/INT-ACT/
11
+
12
  See https://github.com/octo-models/octo for instructions for using this model.
13
 
14
  Octo Base is trained with a window size of 2, predicting 7-dimensional actions 4 steps into the future using a diffusion policy. The model is a Transformer with 93M parameters (equivalent to a ViT-B). Images are tokenized by preprocessing with a lightweight convolutional encoder, then grouped into 16x16 patches. Language is tokenized by applying the T5 tokenizer, and then applying the T5-Base language encoder.