A fine-tuned GR00T model on the Fractal dataset(60k steps, 8 A100 GPUs) follows the default fine-tuning settings (i.e., freezing the VLM backbone).

The evaluation was conducted using the SimplerEnv-OpenVLA repository (https://github.com/DelinQu/SimplerEnv-OpenVLA), with thanks to their contributions to the community.

This fine-tuned model should not be considered representative of the GR00T's actual performance.

0 1 2 3 4 5 6 7 8 9
coke_can/matching_avg 0.5166666666666666 0.857 0.71 0.567 0.787 0.17 nan 0.027 0.163 0.727
coke_can/variant_avg 0.6355555555555555 0.898 0.813 0.49 0.823 0.006 nan 0.022 0.545 nan
coke_can/matching/horizontal 0.47 0.96 0.86 0.82 0.74 0.21 nan 0.05 0.27 0.85
coke_can/matching/vertical 0.13 0.9 0.79 0.33 0.74 0.21 nan 0.0 0.03 0.43
coke_can/matching/standing 0.95 0.71 0.48 0.55 0.88 0.09 nan 0.03 0.19 0.9
coke_can/variant/horizontal 0.7111111111111111 0.969 0.92 0.569 0.822 0.005 nan 0.022 0.711 nan
coke_can/variant/vertical 0.3244444444444444 0.76 0.704 0.204 0.754 0.0 nan 0.013 0.271 nan
coke_can/variant/standing 0.8711111111111111 0.964 0.813 0.698 0.893 0.013 nan 0.031 0.653 nan
move_near/variant 0.51 0.5 0.446 0.323 0.792 0.031 nan 0.04 0.477 nan
move_near/matching 0.54 0.442 0.354 0.317 0.779 0.042 nan 0.05 0.462 0.663
drawer/matching_avg 0.2777777777777778 0.73 0.565 0.597 0.25 0.227 nan 0.139 0.356 0.268
drawer/variant_avg 0.13227513227513227 0.323 0.267 0.294 0.353 0.011 nan 0.069 0.177 nan
drawer/matching/open 0.26851851851851855 0.601 0.463 0.296 0.157 0.009 nan 0.0 0.194 0.287
drawer/matching/close 0.28703703703703703 0.861 0.667 0.891 0.343 0.444 nan 0.278 0.518 0.25
drawer/variant/open 0.08465608465608465 0.27 0.212 0.069 0.333 0.0 nan 0.005 0.158 nan
drawer/variant/close 0.1798941798941799 0.376 0.323 0.519 0.372 0.021 nan 0.132 0.195 nan
apple_in_drawer/matching_avg 0.07407407407407407 0.065 0.13 0.213 0.037 0.0 0.0 0.0 nan 0.361
apple_in_drawer/variant_avg 0.022857142857142857 0.026 0.021 0.101 0.206 0.0 0.0 0.0 nan nan
ckpt_name GR00T-N1.5 RT-1(Converged) RT-1(15%) RT-1-X RT-2-X Octo-Base Octo-Small RT-1(begin) OpenVLA RoboVLM

Data configuration:

In addition to adding the following code to data_config.py, I also provide the modality.json, which is required for the GR00T dataloader.


class FractalDataConfig(So100DataConfig):
    video_keys = ["video.image", ]
    state_keys = ["state.x", "state.y", "state.z", "state.rx", "state.ry", "state.rz", "state.rw",  "state.gripper"]
    action_keys = ["action.x", "action.y", "action.z", "action.roll", "action.pitch", "action.yaw", "action.gripper"]
    language_keys = ["annotation.human.action.task_description"]

    def transform(self) -> ModalityTransform:
        transforms = [
            # video transforms
            VideoToTensor(apply_to=self.video_keys),
            VideoCrop(apply_to=self.video_keys, scale=0.95),
            VideoResize(apply_to=self.video_keys, height=224, width=224, interpolation="linear"),
            VideoColorJitter(
                apply_to=self.video_keys,
                brightness=0.3,
                contrast=0.4,
                saturation=0.5,
                hue=0.08,
            ),
            VideoToNumpy(apply_to=self.video_keys),
            # state transforms
            StateActionToTensor(apply_to=self.state_keys),
            StateActionTransform(
                apply_to=self.state_keys,
                normalization_modes={key: "min_max" for key in self.state_keys},
            ),
            # action transforms
            StateActionToTensor(apply_to=self.action_keys),
            StateActionTransform(
                apply_to=self.action_keys,
                normalization_modes={key: "min_max" for key in self.action_keys},
            ),
            # concat transforms
            ConcatTransform(
                video_concat_order=self.video_keys,
                state_concat_order=self.state_keys,
                action_concat_order=self.action_keys,
            ),
            # model-specific transform
            GR00TTransform(
                state_horizon=len(self.observation_indices),
                action_horizon=len(self.action_indices),
                max_state_dim=64,
                max_action_dim=32,
            ),
        ]
        return ComposedModalityTransform(transforms=transforms)


class BridgeDataConfig(FractalDataConfig):
    video_keys = ["video.image_0", ]
    state_keys = ["state.x", "state.y", "state.z", "state.roll", "state.pitch", "state.yaw", "state.pad",  "state.gripper"]
    action_keys = ["action.x", "action.y", "action.z", "action.roll", "action.pitch", "action.yaw", "action.gripper"]
    language_keys = ["annotation.human.action.task_description"]

Extra embodiment tag to reproduce the results.

class EmbodimentTag(Enum):
    OXE = 'oxe'

# Embodiment tag string: to projector index in the Action Expert Module
EMBODIMENT_TAG_MAPPING = {
    EmbodimentTag.OXE.value: 7,
}

Thanks to @youliangtan, who reevaluated my results.

https://huggingface.co/ShuaiYang03/GR00T-N1.5-Lerobot-SimplerEnv-BridgeV2/discussions/1

https://github.com/NVIDIA/Isaac-GR00T : with the commit hash aa6441feb4f08233d55cbfd2082753cdc01fa676

With the modified SimplerEnv : https://github.com/youliangtan/SimplerEnv

Downloads last month
8
Safetensors
Model size
2.72B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ShuaiYang03/GR00T-N1.5-Lerobot-SimplerEnv-Fractal

Finetuned
(6)
this model

Dataset used to train ShuaiYang03/GR00T-N1.5-Lerobot-SimplerEnv-Fractal