mbreuss
/

flower_libero_spatial

Model card Files Files and versions Community

flower_libero_spatial / README.md

mbreuss's picture

Update README.md

5e07210 verified 3 months ago

|

history blame contribute delete

2.69 kB

	---
	license: mit
	language:
	- en
	base_model:
	- microsoft/Florence-2-large
	pipeline_tag: robotics
	tags:
	- VLA
	- LIBERO
	- Robotics
	- Flow
	---
	# FlowerVLA - Vision-Language-Action Flow Model finetuned on LIBERO Spatial

	This is a pretrained FlowerVLA model for robotic manipulation trained on the LIBERO Spatial dataset.
	Flower is an efficient Vision-Language-Action Flow policy for robot learning that only contains 1B parameters.

	## Model Description

	FlowerVLA is a novel architecture that:
	- Uses half of Florence-2 for multi-modal vision-language encoding
	- Employs an novel transformer-based flow matching architecture
	- Provides an efficient, versatile VLA policy with only ~1B parameters

	## Model Performance

	This checkpoint contains weights for the LIBERO Spatial challenge and achieves these results:

	avg_seq_len success rate 0.9681089520454407
	pick_up_the_black_bowl_between_the_plate_and_the_ramekin_and_place_it_on_the_plate with success 0.9791666666666666
	pick_up_the_black_bowl_next_to_the_ramekin_and_place_it_on_the_plate with success 0.9807692307692308
	pick_up_the_black_bowl_from_table_center_and_place_it_on_the_plate with success 0.9807692307692308
	pick_up_the_black_bowl_on_the_cookie_box_and_place_it_on_the_plate with success 1.0
	pick_up_the_black_bowl_in_the_top_drawer_of_the_wooden_cabinet_and_place_it_on_the_plate with success 1.0
	pick_up_the_black_bowl_on_the_ramekin_and_place_it_on_the_plate with success 0.8621794871794872
	pick_up_the_black_bowl_next_to_the_cookie_box_and_place_it_on_the_plate with success 1.0
	pick_up_the_black_bowl_on_the_stove_and_place_it_on_the_plate with success 1.0
	pick_up_the_black_bowl_next_to_the_plate_and_place_it_on_the_plate with success 0.9166666666666666
	pick_up_the_black_bowl_on_the_wooden_cabinet_and_place_it_on_the_plate with success 0.9615384615384616


	### Input/Output Specifications

	#### Inputs
	- RGB Static Camera: `(B, T, 3, H, W)` tensor
	- RGB Gripper Camera: `(B, T, 3, H, W)` tensor
	- Language Instructions: Text strings

	#### Outputs
	- Action Space: `(B, T, 7)` tensor representing delta EEF actions

	## Usage

	Check out our full model implementation on Github [todo]() and follow the instructions in the readme to test the model on one of the environments.

	```python
	obs = {
	"rgb_obs": {
	"rgb_static": static_image,
	"rgb_gripper": gripper_image
	}
	}
	goal = {"lang_text": "pick up the blue cube"}
	action = model.step(obs, goal)
	```

	## Training Details

	### Configuration
	- Optimizer: AdamW
	- Learning Rate: 2e-5
	- Weight Decay: 0.05


	@inproceedings{
	reuss2025flower,
	# Add citation when available
	}


	## License
	This model is released under the MIT license.