File size: 2,689 Bytes
5e07210 a6abe3b 5e07210 a6abe3b 5e07210 a6abe3b 5e07210 a6abe3b 5e07210 a6abe3b 5e07210 a6abe3b 5e07210 a6abe3b 5e07210 a6abe3b 5e07210 a6abe3b 5e07210 a6abe3b 5e07210 a6abe3b 5e07210 a6abe3b 5e07210 a6abe3b 5e07210 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
---
license: mit
language:
- en
base_model:
- microsoft/Florence-2-large
pipeline_tag: robotics
tags:
- VLA
- LIBERO
- Robotics
- Flow
---
# FlowerVLA - Vision-Language-Action Flow Model finetuned on LIBERO Spatial
This is a pretrained FlowerVLA model for robotic manipulation trained on the LIBERO Spatial dataset.
Flower is an efficient Vision-Language-Action Flow policy for robot learning that only contains 1B parameters.
## Model Description
FlowerVLA is a novel architecture that:
- Uses half of Florence-2 for multi-modal vision-language encoding
- Employs an novel transformer-based flow matching architecture
- Provides an efficient, versatile VLA policy with only ~1B parameters
## Model Performance
This checkpoint contains weights for the LIBERO Spatial challenge and achieves these results:
avg_seq_len success rate 0.9681089520454407
pick_up_the_black_bowl_between_the_plate_and_the_ramekin_and_place_it_on_the_plate with success 0.9791666666666666
pick_up_the_black_bowl_next_to_the_ramekin_and_place_it_on_the_plate with success 0.9807692307692308
pick_up_the_black_bowl_from_table_center_and_place_it_on_the_plate with success 0.9807692307692308
pick_up_the_black_bowl_on_the_cookie_box_and_place_it_on_the_plate with success 1.0
pick_up_the_black_bowl_in_the_top_drawer_of_the_wooden_cabinet_and_place_it_on_the_plate with success 1.0
pick_up_the_black_bowl_on_the_ramekin_and_place_it_on_the_plate with success 0.8621794871794872
pick_up_the_black_bowl_next_to_the_cookie_box_and_place_it_on_the_plate with success 1.0
pick_up_the_black_bowl_on_the_stove_and_place_it_on_the_plate with success 1.0
pick_up_the_black_bowl_next_to_the_plate_and_place_it_on_the_plate with success 0.9166666666666666
pick_up_the_black_bowl_on_the_wooden_cabinet_and_place_it_on_the_plate with success 0.9615384615384616
### Input/Output Specifications
#### Inputs
- RGB Static Camera: `(B, T, 3, H, W)` tensor
- RGB Gripper Camera: `(B, T, 3, H, W)` tensor
- Language Instructions: Text strings
#### Outputs
- Action Space: `(B, T, 7)` tensor representing delta EEF actions
## Usage
Check out our full model implementation on Github [todo]() and follow the instructions in the readme to test the model on one of the environments.
```python
obs = {
"rgb_obs": {
"rgb_static": static_image,
"rgb_gripper": gripper_image
}
}
goal = {"lang_text": "pick up the blue cube"}
action = model.step(obs, goal)
```
## Training Details
### Configuration
- **Optimizer**: AdamW
- **Learning Rate**: 2e-5
- **Weight Decay**: 0.05
@inproceedings{
reuss2025flower,
# Add citation when available
}
## License
This model is released under the MIT license. |