CLIP-RT Finetuned on LIBERO-Spatial
We finetune the original CLIP-RT model with a 300M-parameter action decoder to enable continuous action prediction. This checkpoint is the model finetuned on LIBERO spatial task suite.
Hyperparemeters
Category | Details |
---|---|
Train | 8 × H100 GPUs, each with 80GB VRAM (batch size: 256) |
Model size | 1.3B (CLIP-RT base + 0.3B action decoder) |
Action dimension | 7D end-effector action × 8 action chunks |
Loss | L1 regression |
Epochs | 128 |
Performance | 95.2% success rate on the LIBERO-Spatial task suite |
Throughput | 163Hz |
Inference | One GPU with 9GB VRAM |
Usage Instructions
If you want to evaluate this model on the LIBERO simulator, please refer to the clip-rt github repository.
Citation
@article{kang2024cliprt,
title={CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision},
author={Kang, Gi-Cheon and Kim, Junghyun and Shim, Kyuhwan and Lee, Jun Ki and Zhang, Byoung-Tak},
journal={arXiv preprint arXiv:2411.00508},
year = {2024}
}