A fine-tuned GR00T model on the Fractal dataset(60k steps, 8 A100 GPUs) follows the default fine-tuning settings (i.e., freezing the VLM backbone).
The evaluation was conducted using the SimplerEnv-OpenVLA repository (https://github.com/DelinQu/SimplerEnv-OpenVLA), with thanks to their contributions to the community.
This fine-tuned model should not be considered representative of the GR00T's actual performance.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
coke_can/matching_avg | 0.5166666666666666 | 0.857 | 0.71 | 0.567 | 0.787 | 0.17 | nan | 0.027 | 0.163 | 0.727 |
coke_can/variant_avg | 0.6355555555555555 | 0.898 | 0.813 | 0.49 | 0.823 | 0.006 | nan | 0.022 | 0.545 | nan |
coke_can/matching/horizontal | 0.47 | 0.96 | 0.86 | 0.82 | 0.74 | 0.21 | nan | 0.05 | 0.27 | 0.85 |
coke_can/matching/vertical | 0.13 | 0.9 | 0.79 | 0.33 | 0.74 | 0.21 | nan | 0.0 | 0.03 | 0.43 |
coke_can/matching/standing | 0.95 | 0.71 | 0.48 | 0.55 | 0.88 | 0.09 | nan | 0.03 | 0.19 | 0.9 |
coke_can/variant/horizontal | 0.7111111111111111 | 0.969 | 0.92 | 0.569 | 0.822 | 0.005 | nan | 0.022 | 0.711 | nan |
coke_can/variant/vertical | 0.3244444444444444 | 0.76 | 0.704 | 0.204 | 0.754 | 0.0 | nan | 0.013 | 0.271 | nan |
coke_can/variant/standing | 0.8711111111111111 | 0.964 | 0.813 | 0.698 | 0.893 | 0.013 | nan | 0.031 | 0.653 | nan |
move_near/variant | 0.51 | 0.5 | 0.446 | 0.323 | 0.792 | 0.031 | nan | 0.04 | 0.477 | nan |
move_near/matching | 0.54 | 0.442 | 0.354 | 0.317 | 0.779 | 0.042 | nan | 0.05 | 0.462 | 0.663 |
drawer/matching_avg | 0.2777777777777778 | 0.73 | 0.565 | 0.597 | 0.25 | 0.227 | nan | 0.139 | 0.356 | 0.268 |
drawer/variant_avg | 0.13227513227513227 | 0.323 | 0.267 | 0.294 | 0.353 | 0.011 | nan | 0.069 | 0.177 | nan |
drawer/matching/open | 0.26851851851851855 | 0.601 | 0.463 | 0.296 | 0.157 | 0.009 | nan | 0.0 | 0.194 | 0.287 |
drawer/matching/close | 0.28703703703703703 | 0.861 | 0.667 | 0.891 | 0.343 | 0.444 | nan | 0.278 | 0.518 | 0.25 |
drawer/variant/open | 0.08465608465608465 | 0.27 | 0.212 | 0.069 | 0.333 | 0.0 | nan | 0.005 | 0.158 | nan |
drawer/variant/close | 0.1798941798941799 | 0.376 | 0.323 | 0.519 | 0.372 | 0.021 | nan | 0.132 | 0.195 | nan |
apple_in_drawer/matching_avg | 0.07407407407407407 | 0.065 | 0.13 | 0.213 | 0.037 | 0.0 | 0.0 | 0.0 | nan | 0.361 |
apple_in_drawer/variant_avg | 0.022857142857142857 | 0.026 | 0.021 | 0.101 | 0.206 | 0.0 | 0.0 | 0.0 | nan | nan |
ckpt_name | GR00T-N1.5 | RT-1(Converged) | RT-1(15%) | RT-1-X | RT-2-X | Octo-Base | Octo-Small | RT-1(begin) | OpenVLA | RoboVLM |
Data configuration:
In addition to adding the following code to data_config.py
, I also provide the modality.json
, which is required for the GR00T dataloader.
class FractalDataConfig(So100DataConfig):
video_keys = ["video.image", ]
state_keys = ["state.x", "state.y", "state.z", "state.rx", "state.ry", "state.rz", "state.rw", "state.gripper"]
action_keys = ["action.x", "action.y", "action.z", "action.roll", "action.pitch", "action.yaw", "action.gripper"]
language_keys = ["annotation.human.action.task_description"]
def transform(self) -> ModalityTransform:
transforms = [
# video transforms
VideoToTensor(apply_to=self.video_keys),
VideoCrop(apply_to=self.video_keys, scale=0.95),
VideoResize(apply_to=self.video_keys, height=224, width=224, interpolation="linear"),
VideoColorJitter(
apply_to=self.video_keys,
brightness=0.3,
contrast=0.4,
saturation=0.5,
hue=0.08,
),
VideoToNumpy(apply_to=self.video_keys),
# state transforms
StateActionToTensor(apply_to=self.state_keys),
StateActionTransform(
apply_to=self.state_keys,
normalization_modes={key: "min_max" for key in self.state_keys},
),
# action transforms
StateActionToTensor(apply_to=self.action_keys),
StateActionTransform(
apply_to=self.action_keys,
normalization_modes={key: "min_max" for key in self.action_keys},
),
# concat transforms
ConcatTransform(
video_concat_order=self.video_keys,
state_concat_order=self.state_keys,
action_concat_order=self.action_keys,
),
# model-specific transform
GR00TTransform(
state_horizon=len(self.observation_indices),
action_horizon=len(self.action_indices),
max_state_dim=64,
max_action_dim=32,
),
]
return ComposedModalityTransform(transforms=transforms)
class BridgeDataConfig(FractalDataConfig):
video_keys = ["video.image_0", ]
state_keys = ["state.x", "state.y", "state.z", "state.roll", "state.pitch", "state.yaw", "state.pad", "state.gripper"]
action_keys = ["action.x", "action.y", "action.z", "action.roll", "action.pitch", "action.yaw", "action.gripper"]
language_keys = ["annotation.human.action.task_description"]
Extra embodiment tag to reproduce the results.
class EmbodimentTag(Enum):
OXE = 'oxe'
# Embodiment tag string: to projector index in the Action Expert Module
EMBODIMENT_TAG_MAPPING = {
EmbodimentTag.OXE.value: 7,
}
Thanks to @youliangtan, who reevaluated my results.
https://github.com/NVIDIA/Isaac-GR00T : with the commit hash aa6441feb4f08233d55cbfd2082753cdc01fa676
With the modified SimplerEnv : https://github.com/youliangtan/SimplerEnv
- Downloads last month
- 8
Model tree for ShuaiYang03/GR00T-N1.5-Lerobot-SimplerEnv-Fractal
Base model
nvidia/GR00T-N1.5-3B