CLIP-RT Finetuned on LIBERO-Spatial

We finetune the original CLIP-RT model with a 300M-parameter action decoder to enable continuous action prediction. This checkpoint is the model finetuned on LIBERO spatial task suite.

Hyperparemeters

Category	Details
Train	8 × H100 GPUs, each with 80GB VRAM (batch size: 256)
Model size	1.3B (CLIP-RT base + 0.3B action decoder)
Action dimension	7D end-effector action × 8 action chunks
Loss	L1 regression
Epochs	128
Performance	95.2% success rate on the LIBERO-Spatial task suite
Throughput	163Hz
Inference	One GPU with 9GB VRAM

Usage Instructions

If you want to evaluate this model on the LIBERO simulator, please refer to the clip-rt github repository.

Citation

@article{kang2024cliprt,
  title={CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision},
  author={Kang, Gi-Cheon and Kim, Junghyun and Shim, Kyuhwan and Lee, Jun Ki and Zhang, Byoung-Tak},
  journal={arXiv preprint arXiv:2411.00508},
  year = {2024}
}

clip-rt
/

clip-rt-finetuned-libero-spatial

CLIP-RT Finetuned on LIBERO-Spatial

Hyperparemeters

Usage Instructions

Citation

Dataset used to train clip-rt/clip-rt-finetuned-libero-spatial