ShuaiYang03/GR00T-N1.5-Lerobot-SimplerEnv-Fractal

A fine-tuned GR00T model on the Fractal dataset(60k steps, 8 A100 GPUs) follows the default fine-tuning settings (i.e., freezing the VLM backbone).

The evaluation was conducted using the SimplerEnv-OpenVLA repository (https://github.com/DelinQu/SimplerEnv-OpenVLA), with thanks to their contributions to the community.

This fine-tuned model should not be considered representative of the GR00T's actual performance.

	0	1	2	3	4	5	6	7	8	9
coke_can/matching_avg	0.5166666666666666	0.857	0.71	0.567	0.787	0.17	nan	0.027	0.163	0.727
coke_can/variant_avg	0.6355555555555555	0.898	0.813	0.49	0.823	0.006	nan	0.022	0.545	nan
coke_can/matching/horizontal	0.47	0.96	0.86	0.82	0.74	0.21	nan	0.05	0.27	0.85
coke_can/matching/vertical	0.13	0.9	0.79	0.33	0.74	0.21	nan	0.0	0.03	0.43
coke_can/matching/standing	0.95	0.71	0.48	0.55	0.88	0.09	nan	0.03	0.19	0.9
coke_can/variant/horizontal	0.7111111111111111	0.969	0.92	0.569	0.822	0.005	nan	0.022	0.711	nan
coke_can/variant/vertical	0.3244444444444444	0.76	0.704	0.204	0.754	0.0	nan	0.013	0.271	nan
coke_can/variant/standing	0.8711111111111111	0.964	0.813	0.698	0.893	0.013	nan	0.031	0.653	nan
move_near/variant	0.51	0.5	0.446	0.323	0.792	0.031	nan	0.04	0.477	nan
move_near/matching	0.54	0.442	0.354	0.317	0.779	0.042	nan	0.05	0.462	0.663
drawer/matching_avg	0.2777777777777778	0.73	0.565	0.597	0.25	0.227	nan	0.139	0.356	0.268
drawer/variant_avg	0.13227513227513227	0.323	0.267	0.294	0.353	0.011	nan	0.069	0.177	nan
drawer/matching/open	0.26851851851851855	0.601	0.463	0.296	0.157	0.009	nan	0.0	0.194	0.287
drawer/matching/close	0.28703703703703703	0.861	0.667	0.891	0.343	0.444	nan	0.278	0.518	0.25
drawer/variant/open	0.08465608465608465	0.27	0.212	0.069	0.333	0.0	nan	0.005	0.158	nan
drawer/variant/close	0.1798941798941799	0.376	0.323	0.519	0.372	0.021	nan	0.132	0.195	nan
apple_in_drawer/matching_avg	0.07407407407407407	0.065	0.13	0.213	0.037	0.0	0.0	0.0	nan	0.361
apple_in_drawer/variant_avg	0.022857142857142857	0.026	0.021	0.101	0.206	0.0	0.0	0.0	nan	nan
ckpt_name	GR00T-N1.5	RT-1(Converged)	RT-1(15%)	RT-1-X	RT-2-X	Octo-Base	Octo-Small	RT-1(begin)	OpenVLA	RoboVLM

Data configuration:

In addition to adding the following code to data_config.py, I also provide the modality.json, which is required for the GR00T dataloader.


class FractalDataConfig(So100DataConfig):
    video_keys = ["video.image", ]
    state_keys = ["state.x", "state.y", "state.z", "state.rx", "state.ry", "state.rz", "state.rw",  "state.gripper"]
    action_keys = ["action.x", "action.y", "action.z", "action.roll", "action.pitch", "action.yaw", "action.gripper"]
    language_keys = ["annotation.human.action.task_description"]

    def transform(self) -> ModalityTransform:
        transforms = [
            # video transforms
            VideoToTensor(apply_to=self.video_keys),
            VideoCrop(apply_to=self.video_keys, scale=0.95),
            VideoResize(apply_to=self.video_keys, height=224, width=224, interpolation="linear"),
            VideoColorJitter(
                apply_to=self.video_keys,
                brightness=0.3,
                contrast=0.4,
                saturation=0.5,
                hue=0.08,
            ),
            VideoToNumpy(apply_to=self.video_keys),
            # state transforms
            StateActionToTensor(apply_to=self.state_keys),
            StateActionTransform(
                apply_to=self.state_keys,
                normalization_modes={key: "min_max" for key in self.state_keys},
            ),
            # action transforms
            StateActionToTensor(apply_to=self.action_keys),
            StateActionTransform(
                apply_to=self.action_keys,
                normalization_modes={key: "min_max" for key in self.action_keys},
            ),
            # concat transforms
            ConcatTransform(
                video_concat_order=self.video_keys,
                state_concat_order=self.state_keys,
                action_concat_order=self.action_keys,
            ),
            # model-specific transform
            GR00TTransform(
                state_horizon=len(self.observation_indices),
                action_horizon=len(self.action_indices),
                max_state_dim=64,
                max_action_dim=32,
            ),
        ]
        return ComposedModalityTransform(transforms=transforms)


class BridgeDataConfig(FractalDataConfig):
    video_keys = ["video.image_0", ]
    state_keys = ["state.x", "state.y", "state.z", "state.roll", "state.pitch", "state.yaw", "state.pad",  "state.gripper"]
    action_keys = ["action.x", "action.y", "action.z", "action.roll", "action.pitch", "action.yaw", "action.gripper"]
    language_keys = ["annotation.human.action.task_description"]

Extra embodiment tag to reproduce the results.

class EmbodimentTag(Enum):
    OXE = 'oxe'

# Embodiment tag string: to projector index in the Action Expert Module
EMBODIMENT_TAG_MAPPING = {
    EmbodimentTag.OXE.value: 7,
}

Thanks to @youliangtan, who reevaluated my results.

https://huggingface.co/ShuaiYang03/GR00T-N1.5-Lerobot-SimplerEnv-BridgeV2/discussions/1

https://github.com/NVIDIA/Isaac-GR00T : with the commit hash aa6441feb4f08233d55cbfd2082753cdc01fa676

With the modified SimplerEnv : https://github.com/youliangtan/SimplerEnv

ShuaiYang03
/

GR00T-N1.5-Lerobot-SimplerEnv-Fractal

Model tree for ShuaiYang03/GR00T-N1.5-Lerobot-SimplerEnv-Fractal

Dataset used to train ShuaiYang03/GR00T-N1.5-Lerobot-SimplerEnv-Fractal