FlowerVLA - Vision-Language-Action Flow Model finetuned on LIBERO Spatial
This is a pretrained FlowerVLA model for robotic manipulation trained on the LIBERO Spatial dataset. Flower is an efficient Vision-Language-Action Flow policy for robot learning that only contains 1B parameters.
Model Description
FlowerVLA is a novel architecture that:
- Uses half of Florence-2 for multi-modal vision-language encoding
- Employs an novel transformer-based flow matching architecture
- Provides an efficient, versatile VLA policy with only ~1B parameters
Model Performance
This checkpoint contains weights for the LIBERO Spatial challenge and achieves these results:
avg_seq_len success rate 0.9681089520454407 pick_up_the_black_bowl_between_the_plate_and_the_ramekin_and_place_it_on_the_plate with success 0.9791666666666666 pick_up_the_black_bowl_next_to_the_ramekin_and_place_it_on_the_plate with success 0.9807692307692308 pick_up_the_black_bowl_from_table_center_and_place_it_on_the_plate with success 0.9807692307692308 pick_up_the_black_bowl_on_the_cookie_box_and_place_it_on_the_plate with success 1.0 pick_up_the_black_bowl_in_the_top_drawer_of_the_wooden_cabinet_and_place_it_on_the_plate with success 1.0 pick_up_the_black_bowl_on_the_ramekin_and_place_it_on_the_plate with success 0.8621794871794872 pick_up_the_black_bowl_next_to_the_cookie_box_and_place_it_on_the_plate with success 1.0 pick_up_the_black_bowl_on_the_stove_and_place_it_on_the_plate with success 1.0 pick_up_the_black_bowl_next_to_the_plate_and_place_it_on_the_plate with success 0.9166666666666666 pick_up_the_black_bowl_on_the_wooden_cabinet_and_place_it_on_the_plate with success 0.9615384615384616
Input/Output Specifications
Inputs
- RGB Static Camera:
(B, T, 3, H, W)
tensor - RGB Gripper Camera:
(B, T, 3, H, W)
tensor - Language Instructions: Text strings
Outputs
- Action Space:
(B, T, 7)
tensor representing delta EEF actions
Usage
Check out our full model implementation on Github todo and follow the instructions in the readme to test the model on one of the environments.
obs = {
"rgb_obs": {
"rgb_static": static_image,
"rgb_gripper": gripper_image
}
}
goal = {"lang_text": "pick up the blue cube"}
action = model.step(obs, goal)
Training Details
Configuration
- Optimizer: AdamW
- Learning Rate: 2e-5
- Weight Decay: 0.05
@inproceedings{ reuss2025flower, # Add citation when available }
License
This model is released under the MIT license.
- Downloads last month
- 6
Model tree for mbreuss/flower_libero_spatial
Base model
microsoft/Florence-2-large