File size: 1,898 Bytes
55a04f5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
---
license: mit
datasets:
- clip-rt/modified_libero_hdf5
language:
- en
tags:
- robotics
- vla
- clip
- contrastive_learning
---
# CLIP-RT Finetuned on LIBERO-10
We finetune the original [CLIP-RT model](https://clip-rt.github.io/) with a 300M-parameter action decoder to enable continuous action prediction. This checkpoint is the model finetuned on [LIBERO](https://libero-project.github.io/main.html) 10 task suite.
## Hyperparemeters
| Category | Details |
|----------------------|---------------------------------------------------------------------|
| **Train** | 8 × H100 GPUs, each with 80GB VRAM (batch size: 256) |
| **Model size** | 1.3B (CLIP-RT base + 0.3B action decoder) |
| **Action dimension** | 7D end-effector action × 8 action chunks |
| **Loss** | L1 regression |
| **Epochs** | 128 |
| **Performance** | 83.8% success rate on the LIBERO-10 task suite |
| **Throughput** | 163Hz |
| **Inference** | One GPU with 9GB VRAM |
## Usage Instructions
If you want to evaluate this model on the LIBERO simulator, please refer to the [clip-rt github repository](https://github.com/clip-rt/clip-rt/tree/main/libero).
## Citation
```bibtex
@article{kang2024cliprt,
title={CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision},
author={Kang, Gi-Cheon and Kim, Junghyun and Shim, Kyuhwan and Lee, Jun Ki and Zhang, Byoung-Tak},
journal={arXiv preprint arXiv:2411.00508},
year = {2024}
}
``` |