ShuaiYang03/GR00T-N1.5-Lerobot-SimplerEnv-BridgeV2

datasets: - IPEC-COMMUNITY/bridge_orig_lerobot base_model: - nvidia/GR00T-N1.5-3B

A fine-tuned GR00T model on the Bridge dataset(30k steps, 8 A100 GPUs) follows the default fine-tuning settings (i.e., freezing the VLM backbone).

The evaluation was conducted using the SimplerEnv-OpenVLA repository (https://github.com/DelinQu/SimplerEnv-OpenVLA), with thanks to their contributions to the community.

This fine-tuned model should not be considered representative of the GR00T's actual performance.

	0	1	2	3	4	5	6	7	8	9
put_spoon_on_tablecloth/matching_partial	0.8333333333333334	nan	nan	0.167	nan	0.347	0.778	nan	0.041	0.375
put_spoon_on_tablecloth/matching_entire	0.625	nan	nan	0.0	nan	0.125	0.472	nan	0.0	0.208
put_carrot_on_plate/matching_partial	0.5416666666666666	nan	nan	0.208	nan	0.528	0.278	nan	0.333	0.333
put_carrot_on_plate/matching_entire	0.4583333333333333	nan	nan	0.042	nan	0.083	0.097	nan	0.0	0.25
stack_green_block_on_yellow_block/matching_partial	0.7083333333333334	nan	nan	0.083	nan	0.319	0.403	nan	0.125	0.083
stack_green_block_on_yellow_block/matching_entire	0.16666666666666666	nan	nan	0.0	nan	0.0	0.042	nan	0.0	0.083
put_eggplant_in_basket/matching_partial	0.4166666666666667	nan	nan	0.0	nan	0.667	0.875	nan	0.083	0.0
put_eggplant_in_basket/matching_entire	0.20833333333333334	nan	nan	0.0	nan	0.431	0.569	nan	0.041	0.0
ckpt_name	GR00T-N1.5	RT-1(Converged)	RT-1(15%)	RT-1-X	RT-2-X	Octo-Base	Octo-Small	RT-1(begin)	OpenVLA	RoboVLM

Data configuration:

In addition to adding the following code to data_config.py, I also provide the modality.json, which is required for the GR00T dataloader.


class FractalDataConfig(So100DataConfig):
    video_keys = ["video.image", ]
    state_keys = ["state.x", "state.y", "state.z", "state.rx", "state.ry", "state.rz", "state.rw",  "state.gripper"]
    action_keys = ["action.x", "action.y", "action.z", "action.roll", "action.pitch", "action.yaw", "action.gripper"]
    language_keys = ["annotation.human.action.task_description"]

    def transform(self) -> ModalityTransform:
        transforms = [
            # video transforms
            VideoToTensor(apply_to=self.video_keys),
            VideoCrop(apply_to=self.video_keys, scale=0.95),
            VideoResize(apply_to=self.video_keys, height=224, width=224, interpolation="linear"),
            VideoColorJitter(
                apply_to=self.video_keys,
                brightness=0.3,
                contrast=0.4,
                saturation=0.5,
                hue=0.08,
            ),
            VideoToNumpy(apply_to=self.video_keys),
            # state transforms
            StateActionToTensor(apply_to=self.state_keys),
            StateActionTransform(
                apply_to=self.state_keys,
                normalization_modes={key: "min_max" for key in self.state_keys},
            ),
            # action transforms
            StateActionToTensor(apply_to=self.action_keys),
            StateActionTransform(
                apply_to=self.action_keys,
                normalization_modes={key: "min_max" for key in self.action_keys},
            ),
            # concat transforms
            ConcatTransform(
                video_concat_order=self.video_keys,
                state_concat_order=self.state_keys,
                action_concat_order=self.action_keys,
            ),
            # model-specific transform
            GR00TTransform(
                state_horizon=len(self.observation_indices),
                action_horizon=len(self.action_indices),
                max_state_dim=64,
                max_action_dim=32,
            ),
        ]
        return ComposedModalityTransform(transforms=transforms)


class BridgeDataConfig(FractalDataConfig):
    video_keys = ["video.image_0", ]
    state_keys = ["state.x", "state.y", "state.z", "state.roll", "state.pitch", "state.yaw", "state.pad",  "state.gripper"]
    action_keys = ["action.x", "action.y", "action.z", "action.roll", "action.pitch", "action.yaw", "action.gripper"]
    language_keys = ["annotation.human.action.task_description"]

Extra embodiment tag to reproduce the results.


class EmbodimentTag(Enum):
    OXE = 'oxe'

# Embodiment tag string: to projector index in the Action Expert Module
EMBODIMENT_TAG_MAPPING = {
    EmbodimentTag.OXE.value: 7,
}

Thanks to @youliangtan, who reevaluated my results.

https://huggingface.co/ShuaiYang03/GR00T-N1.5-Lerobot-SimplerEnv-BridgeV2/discussions/1

https://github.com/NVIDIA/Isaac-GR00T : with the commit hash aa6441feb4f08233d55cbfd2082753cdc01fa676

With the modified SimplerEnv : https://github.com/youliangtan/SimplerEnv

ShuaiYang03
/

GR00T-N1.5-Lerobot-SimplerEnv-BridgeV2

datasets: - IPEC-COMMUNITY/bridge_orig_lerobot base_model: - nvidia/GR00T-N1.5-3B

Model tree for ShuaiYang03/GR00T-N1.5-Lerobot-SimplerEnv-BridgeV2

Dataset used to train ShuaiYang03/GR00T-N1.5-Lerobot-SimplerEnv-BridgeV2