|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
tags: |
|
- vision-language-action |
|
- edge-deployment |
|
- tensorRT |
|
- qwen |
|
base_model: Stanford-ILIAD/minivla-vq-libero90-prismatic |
|
library_name: transformers |
|
datasets: |
|
- LIBERO |
|
pipeline_tag: image-text-to-text |
|
--- |
|
|
|
# MiniVLA |
|
|
|
This repository hosts **MiniVLA** β a modular and deployment-friendly Vision-Language-Action (VLA) model designed for **edge hardware** (e.g., Jetson Orin Nano). |
|
It contains model checkpoints, Hugging Faceβcompatible Qwen-0.5B LLM, and ONNX/TensorRT exports for accelerated inference. |
|
|
|
--- |
|
|
|
## π Introduction |
|
|
|
To enable low-latency, high-security desktop robot tasks on local devices, this project focuses on addressing the deployment and performance challenges of lightweight multimodal models on edge hardware. Using OpenVLA-Mini as a case study, we propose a hybrid acceleration pipeline designed to alleviate deployment bottlenecks on resource-constrained platforms. |
|
|
|
We reproduced a lightweight VLA model and then significantly reduced its end-to-end latency and GPU memory usage by exporting the vision encoder into ONNX and TensorRT engines. While we observed a moderate drop in the task success rate (around 5-10% in LIBERO desktop operation tasks), our results still demonstrate the feasibility of achieving efficient, real-time VLA inference on the edge side. |
|
|
|
--- |
|
|
|
## ποΈ System Architecture |
|
|
|
The MiniVLA deployment is designed with modular microservices: |
|
|
|
<p align="center"> |
|
<img src="./Results/System_Architecture.svg" width="100%" > |
|
</p> |
|
|
|
|
|
- **Inputs**: image + language instruction |
|
- **Vision Encoder**: DinoV2 / SigLIP β ONNX/TensorRT |
|
- **LLM**: Qwen 2.5 0.5B (Hugging Face / TensorRT-LLM) |
|
- **Router & Fallback**: balances between local inference and accelerated microservices |
|
- **Robot Action**: decoded from predicted action tokens |
|
|
|
### Hybrid Acceleration |
|
|
|
<p align="center"> |
|
<img src="./Results/MiniVLA_Architecture.svg" width="100%" > |
|
</p> |
|
|
|
|
|
- **Vision Encoder Acceleration**: PyTorch β ONNX β TensorRT, deployed as microservice (`/vision/encode`) |
|
- **LLM Acceleration**: Hugging Face β TensorRT-LLM engine, deployed as microservice (`/llm/generate`) |
|
- **Main Process**: Orchestrates requests, ensures fallback, and outputs robot actions |
|
|
|
--- |
|
|
|
## π¦ Contents |
|
|
|
- **`models/`** |
|
Contains the original MiniVLA model checkpoints, based on |
|
[Stanford-ILIAD/minivla-vq-libero90-prismatic](https://huggingface.co/Stanford-ILIAD/minivla-vq-libero90-prismatic). |
|
Special thanks to the Stanford ILIAD team for their open-source contribution. |
|
|
|
- **`qwen25-0_5b-trtllm/`** |
|
Qwen-0.5B language model converted to TensorRT-LLM format. |
|
|
|
- **`qwen25-0_5b-with-extra-tokenizer/`** |
|
Hugging Faceβcompatible Qwen-0.5B model with extended tokenizer. |
|
|
|
- **`tensorRT/`** |
|
Vision encoder acceleration files: |
|
- |
|
`vision_encoder_fp16.onnx` |
|
- `vision_encoder_fp16.engine` |
|
|
|
--- |
|
|
|
|
|
## π Related Project |
|
|
|
For full implementation and code, please visit the companion GitHub repository: |
|
π [https://github.com/Zhenxintao/MiniVLA](https://github.com/Zhenxintao/MiniVLA) |
|
|
|
|
|
## π Usage |
|
|
|
### Load Hugging Face Qwen-0.5B |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
model_name = "xintaozhen/MiniVLA/qwen25-0_5b-with-extra-tokenizer" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True) |
|
``` |
|
|
|
### Call TensorRT Vision Encoder (HTTP API) |
|
|
|
```python |
|
import requests |
|
|
|
url = "http://vision.svc:8000/vision/encode" |
|
image_data = {"image": "base64_encoded_image"} |
|
response = requests.post(url, json=image_data) |
|
vision_embedding = response.json() |
|
``` |
|
|
|
### Call TensorRT-LLM (HTTP API) |
|
|
|
```python |
|
|
|
import requests |
|
|
|
url = "http://llm.svc:8810/llm/generate" |
|
payload = {"prompt": "Close the top drawer of the cabinet."} |
|
response = requests.post(url, json=payload) |
|
generated_actions = response.json() |
|
``` |
|
|
|
--- |
|
|
|
## π Key Contributions |
|
|
|
- Built an **end-to-end online inference framework** with a FastAPI service (`/act`), transforming offline benchmark code into a **real-time deployable system**. |
|
- Reproduced a lightweight **OpenVLA-Mini** and proposed a **hybrid acceleration pipeline**. |
|
- Exported the **vision encoder** to TensorRT, reducing perception latency and GPU memory usage. |
|
- Improved **GPU memory efficiency**: reduced average utilization from ~67% to ~43%, and peak usage from ~85% to ~65%, making deployment feasible under 8 GB memory constraints (similar to Jetson-class devices). |
|
- Integrated **Qwen 2.5 0.5B** in Hugging Face and TensorRT-LLM formats. |
|
- Designed a **modular system architecture** with router & fallback for robustness. |
|
- Demonstrated efficient **edge-side VLA inference** on Jetson Orin Nano in LIBERO tasks, with only a moderate performance drop (5β10%). |
|
|
|
--- |
|
|
|
## π₯οΈ Device & Performance |
|
|
|
Target deployment: **Jetson Orin Nano (16 GB / 8 GB variants)**. |
|
|
|
For simulation and reproducibility, experiments were conducted on a **local workstation** equipped with: |
|
|
|
- **GPU**: NVIDIA GeForce RTX 4060 Laptop GPU (8 GB VRAM) |
|
- **Driver / CUDA**: Driver 550.144.03, CUDA 12.4 |
|
- **OS**: Ubuntu 22.04 LTS |
|
|
|
β οΈ **Note**: Although the experiments were run on RTX 4060, the GPU memory (8 GB) is comparable to entry-level Jetson devices, making it a suitable proxy for evaluating edge deployment feasibility. |
|
|
|
### GPU Memory Utilization (Long-Sequence Tasks) |
|
|
|
| Model Variant | Avg. GPU Utilization | Peak GPU Utilization | |
|
| --------------------------------------- | -------------------- | -------------------- | |
|
| Original MiniVLA (PyTorch, no TRT) | ~67% | ~85% | |
|
| MiniVLA w/ TensorRT Vision Acceleration | ~43% | ~65% | |
|
|
|
**Observation:** |
|
|
|
- The hybrid acceleration pipeline (TensorRT vision + VLA main process) reduced **average GPU utilization by ~24%** and **peak usage by ~20%**. |
|
- This indicates better **GPU memory efficiency**, allowing longer sequence tasks to run stably under resource-constrained devices. |
|
|
|
### Example nvidia-smi Output |
|
|
|
Original model: |
|
|
|
``` |
|
GPU Memory-Usage: 4115MiB / 8188MiB |
|
GPU-Util: 67% (peak 85%) |
|
``` |
|
|
|
With TensorRT vision acceleration: |
|
|
|
``` |
|
GPU Memory-Usage: 4055MiB / 8188MiB |
|
GPU-Util: 43% (peak 65%) |
|
``` |
|
|
|
--- |
|
|
|
## π License |
|
|
|
Specify the license here (e.g., Apache 2.0, MIT, or same as MiniVLA / Qwen license). |
|
|
|
--- |
|
|
|
## π Citation |
|
|
|
If you use **MiniVLA** in your research or deployment, please cite: |
|
|
|
``` |
|
@misc{MiniVLA2025, |
|
title = {MiniVLA: A Modular Vision-Language-Action Model for Edge Deployment}, |
|
author = {Xintao Zhen}, |
|
year = {2025}, |
|
url = {https://huggingface.co/xintaozhen/MiniVLA} |
|
} |
|
``` |
|
|
|
We also acknowledge and thank the authors of [Stanford-ILIAD/minivla-vq-libero90-prismatic](https://huggingface.co/Stanford-ILIAD/minivla-vq-libero90-prismatic), which serves as the base for the checkpoints included in this repository. |
|
|
|
--- |