datasets: - IPEC-COMMUNITY/bridge_orig_lerobot base_model: - nvidia/GR00T-N1.5-3B

A fine-tuned GR00T model on the Bridge dataset(30k steps, 8 A100 GPUs) follows the default fine-tuning settings (i.e., freezing the VLM backbone).

The evaluation was conducted using the SimplerEnv-OpenVLA repository (https://github.com/DelinQu/SimplerEnv-OpenVLA), with thanks to their contributions to the community.

This fine-tuned model should not be considered representative of the GR00T's actual performance.

0 1 2 3 4 5 6 7 8 9
put_spoon_on_tablecloth/matching_partial 0.8333333333333334 nan nan 0.167 nan 0.347 0.778 nan 0.041 0.375
put_spoon_on_tablecloth/matching_entire 0.625 nan nan 0.0 nan 0.125 0.472 nan 0.0 0.208
put_carrot_on_plate/matching_partial 0.5416666666666666 nan nan 0.208 nan 0.528 0.278 nan 0.333 0.333
put_carrot_on_plate/matching_entire 0.4583333333333333 nan nan 0.042 nan 0.083 0.097 nan 0.0 0.25
stack_green_block_on_yellow_block/matching_partial 0.7083333333333334 nan nan 0.083 nan 0.319 0.403 nan 0.125 0.083
stack_green_block_on_yellow_block/matching_entire 0.16666666666666666 nan nan 0.0 nan 0.0 0.042 nan 0.0 0.083
put_eggplant_in_basket/matching_partial 0.4166666666666667 nan nan 0.0 nan 0.667 0.875 nan 0.083 0.0
put_eggplant_in_basket/matching_entire 0.20833333333333334 nan nan 0.0 nan 0.431 0.569 nan 0.041 0.0
ckpt_name GR00T-N1.5 RT-1(Converged) RT-1(15%) RT-1-X RT-2-X Octo-Base Octo-Small RT-1(begin) OpenVLA RoboVLM

Data configuration:

In addition to adding the following code to data_config.py, I also provide the modality.json, which is required for the GR00T dataloader.


class FractalDataConfig(So100DataConfig):
    video_keys = ["video.image", ]
    state_keys = ["state.x", "state.y", "state.z", "state.rx", "state.ry", "state.rz", "state.rw",  "state.gripper"]
    action_keys = ["action.x", "action.y", "action.z", "action.roll", "action.pitch", "action.yaw", "action.gripper"]
    language_keys = ["annotation.human.action.task_description"]

    def transform(self) -> ModalityTransform:
        transforms = [
            # video transforms
            VideoToTensor(apply_to=self.video_keys),
            VideoCrop(apply_to=self.video_keys, scale=0.95),
            VideoResize(apply_to=self.video_keys, height=224, width=224, interpolation="linear"),
            VideoColorJitter(
                apply_to=self.video_keys,
                brightness=0.3,
                contrast=0.4,
                saturation=0.5,
                hue=0.08,
            ),
            VideoToNumpy(apply_to=self.video_keys),
            # state transforms
            StateActionToTensor(apply_to=self.state_keys),
            StateActionTransform(
                apply_to=self.state_keys,
                normalization_modes={key: "min_max" for key in self.state_keys},
            ),
            # action transforms
            StateActionToTensor(apply_to=self.action_keys),
            StateActionTransform(
                apply_to=self.action_keys,
                normalization_modes={key: "min_max" for key in self.action_keys},
            ),
            # concat transforms
            ConcatTransform(
                video_concat_order=self.video_keys,
                state_concat_order=self.state_keys,
                action_concat_order=self.action_keys,
            ),
            # model-specific transform
            GR00TTransform(
                state_horizon=len(self.observation_indices),
                action_horizon=len(self.action_indices),
                max_state_dim=64,
                max_action_dim=32,
            ),
        ]
        return ComposedModalityTransform(transforms=transforms)


class BridgeDataConfig(FractalDataConfig):
    video_keys = ["video.image_0", ]
    state_keys = ["state.x", "state.y", "state.z", "state.roll", "state.pitch", "state.yaw", "state.pad",  "state.gripper"]
    action_keys = ["action.x", "action.y", "action.z", "action.roll", "action.pitch", "action.yaw", "action.gripper"]
    language_keys = ["annotation.human.action.task_description"]

Extra embodiment tag to reproduce the results.


class EmbodimentTag(Enum):
    OXE = 'oxe'

# Embodiment tag string: to projector index in the Action Expert Module
EMBODIMENT_TAG_MAPPING = {
    EmbodimentTag.OXE.value: 7,
}

Thanks to @youliangtan, who reevaluated my results.

https://huggingface.co/ShuaiYang03/GR00T-N1.5-Lerobot-SimplerEnv-BridgeV2/discussions/1

https://github.com/NVIDIA/Isaac-GR00T : with the commit hash aa6441feb4f08233d55cbfd2082753cdc01fa676

With the modified SimplerEnv : https://github.com/youliangtan/SimplerEnv

Downloads last month
19
Safetensors
Model size
2.72B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ShuaiYang03/GR00T-N1.5-Lerobot-SimplerEnv-BridgeV2

Finetuned
(6)
this model

Dataset used to train ShuaiYang03/GR00T-N1.5-Lerobot-SimplerEnv-BridgeV2