|
--- |
|
license: mit |
|
language: |
|
- en |
|
base_model: |
|
- microsoft/Florence-2-large |
|
pipeline_tag: robotics |
|
tags: |
|
- VLA |
|
- LIBERO |
|
- Robotics |
|
- Flow |
|
--- |
|
# FlowerVLA - Vision-Language-Action Flow Model finetuned on LIBERO Spatial |
|
|
|
This is a pretrained FlowerVLA model for robotic manipulation trained on the LIBERO Spatial dataset. |
|
Flower is an efficient Vision-Language-Action Flow policy for robot learning that only contains 1B parameters. |
|
|
|
## Model Description |
|
|
|
FlowerVLA is a novel architecture that: |
|
- Uses half of Florence-2 for multi-modal vision-language encoding |
|
- Employs an novel transformer-based flow matching architecture |
|
- Provides an efficient, versatile VLA policy with only ~1B parameters |
|
|
|
## Model Performance |
|
|
|
This checkpoint contains weights for the LIBERO Spatial challenge and achieves these results: |
|
|
|
avg_seq_len success rate 0.9681089520454407 |
|
pick_up_the_black_bowl_between_the_plate_and_the_ramekin_and_place_it_on_the_plate with success 0.9791666666666666 |
|
pick_up_the_black_bowl_next_to_the_ramekin_and_place_it_on_the_plate with success 0.9807692307692308 |
|
pick_up_the_black_bowl_from_table_center_and_place_it_on_the_plate with success 0.9807692307692308 |
|
pick_up_the_black_bowl_on_the_cookie_box_and_place_it_on_the_plate with success 1.0 |
|
pick_up_the_black_bowl_in_the_top_drawer_of_the_wooden_cabinet_and_place_it_on_the_plate with success 1.0 |
|
pick_up_the_black_bowl_on_the_ramekin_and_place_it_on_the_plate with success 0.8621794871794872 |
|
pick_up_the_black_bowl_next_to_the_cookie_box_and_place_it_on_the_plate with success 1.0 |
|
pick_up_the_black_bowl_on_the_stove_and_place_it_on_the_plate with success 1.0 |
|
pick_up_the_black_bowl_next_to_the_plate_and_place_it_on_the_plate with success 0.9166666666666666 |
|
pick_up_the_black_bowl_on_the_wooden_cabinet_and_place_it_on_the_plate with success 0.9615384615384616 |
|
|
|
|
|
### Input/Output Specifications |
|
|
|
#### Inputs |
|
- RGB Static Camera: `(B, T, 3, H, W)` tensor |
|
- RGB Gripper Camera: `(B, T, 3, H, W)` tensor |
|
- Language Instructions: Text strings |
|
|
|
#### Outputs |
|
- Action Space: `(B, T, 7)` tensor representing delta EEF actions |
|
|
|
## Usage |
|
|
|
Check out our full model implementation on Github [todo]() and follow the instructions in the readme to test the model on one of the environments. |
|
|
|
```python |
|
obs = { |
|
"rgb_obs": { |
|
"rgb_static": static_image, |
|
"rgb_gripper": gripper_image |
|
} |
|
} |
|
goal = {"lang_text": "pick up the blue cube"} |
|
action = model.step(obs, goal) |
|
``` |
|
|
|
## Training Details |
|
|
|
### Configuration |
|
- **Optimizer**: AdamW |
|
- **Learning Rate**: 2e-5 |
|
- **Weight Decay**: 0.05 |
|
|
|
|
|
@inproceedings{ |
|
reuss2025flower, |
|
# Add citation when available |
|
} |
|
|
|
|
|
## License |
|
This model is released under the MIT license. |