MiniVLA

This repository hosts MiniVLA – a modular and deployment-friendly Vision-Language-Action (VLA) model designed for edge hardware (e.g., Jetson Orin Nano).
It contains model checkpoints, Hugging Face–compatible Qwen-0.5B LLM, and ONNX/TensorRT exports for accelerated inference.

🔎 Introduction

Although the Visual-Language-Action (VLA) model has great potential in desktop robot tasks, its reliance on cloud computing brings inherent network latency, data privacy risks, and reliability challenges.
To achieve localized low-latency and high-security desktop robot tasks, this project takes OpenVLA-Mini as an example and focuses on addressing the deployment and performance challenges of lightweight multimodal models on edge hardware.

We reproduce a lightweight VLA and propose a hybrid acceleration pipeline, which effectively alleviates the deployment bottleneck on resource-constrained platforms.
By exporting the vision encoder into ONNX and TensorRT engines, we significantly reduced end-to-end latency and GPU memory usage. While a moderate drop in task success rate (around 5–10% in LIBERO desktop operation tasks) was observed, the results still demonstrate the feasibility of achieving efficient and real-time VLA inference on the edge side.

🏗️ System Architecture

The MiniVLA deployment is designed with modular microservices:

Inputs: image + language instruction
Vision Encoder: DinoV2 / SigLIP → ONNX/TensorRT
LLM: Qwen 2.5 0.5B (Hugging Face / TensorRT-LLM)
Router & Fallback: balances between local inference and accelerated microservices
Robot Action: decoded from predicted action tokens

Hybrid Acceleration

Vision Encoder Acceleration: PyTorch → ONNX → TensorRT, deployed as microservice (/vision/encode)
LLM Acceleration: Hugging Face → TensorRT-LLM engine, deployed as microservice (/llm/generate)
Main Process: Orchestrates requests, ensures fallback, and outputs robot actions

📦 Contents

models/
Contains the original MiniVLA model checkpoints, based on
Stanford-ILIAD/minivla-vq-libero90-prismatic.
Special thanks to the Stanford ILIAD team for their open-source contribution.
qwen25-0_5b-trtllm/
Qwen-0.5B language model converted to TensorRT-LLM format.
qwen25-0_5b-with-extra-tokenizer/
Hugging Face–compatible Qwen-0.5B model with extended tokenizer.
tensorRT/
Vision encoder acceleration files:

vision_encoder_fp16.onnx
- vision_encoder_fp16.engine

🔗 Related Project

For full implementation and code, please visit the companion GitHub repository:
👉 https://github.com/Zhenxintao/MiniVLA

🚀 Usage

Load Hugging Face Qwen-0.5B

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "xintaozhen/MiniVLA/qwen25-0_5b-with-extra-tokenizer"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

Call TensorRT Vision Encoder (HTTP API)

import requests

url = "http://vision.svc:8000/vision/encode"
image_data = {"image": "base64_encoded_image"}
response = requests.post(url, json=image_data)
vision_embedding = response.json()

Call TensorRT-LLM (HTTP API)


import requests

url = "http://llm.svc:8810/llm/generate"
payload = {"prompt": "Close the top drawer of the cabinet."}
response = requests.post(url, json=payload)
generated_actions = response.json()

🔑 Key Contributions

Built an end-to-end online inference framework with a FastAPI service (/act), transforming offline benchmark code into a real-time deployable system.
Reproduced a lightweight OpenVLA-Mini and proposed a hybrid acceleration pipeline.
Exported the vision encoder to TensorRT, reducing perception latency and GPU memory usage.
Improved GPU memory efficiency: reduced average utilization from ~67% to ~43%, and peak usage from ~85% to ~65%, making deployment feasible under 8 GB memory constraints (similar to Jetson-class devices).
Integrated Qwen 2.5 0.5B in Hugging Face and TensorRT-LLM formats.
Designed a modular system architecture with router & fallback for robustness.
Demonstrated efficient edge-side VLA inference on Jetson Orin Nano in LIBERO tasks, with only a moderate performance drop (5–10%).

🖥️ Device & Performance

Target deployment: Jetson Orin Nano (16 GB / 8 GB variants).

For simulation and reproducibility, experiments were conducted on a local workstation equipped with:

GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8 GB VRAM)
Driver / CUDA: Driver 550.144.03, CUDA 12.4
OS: Ubuntu 22.04 LTS

⚠️ Note: Although the experiments were run on RTX 4060, the GPU memory (8 GB) is comparable to entry-level Jetson devices, making it a suitable proxy for evaluating edge deployment feasibility.

GPU Memory Utilization (Long-Sequence Tasks)

Model Variant	Avg. GPU Utilization	Peak GPU Utilization
Original MiniVLA (PyTorch, no TRT)	~67%	~85%
MiniVLA w/ TensorRT Vision Acceleration	~43%	~65%

Observation:

The hybrid acceleration pipeline (TensorRT vision + VLA main process) reduced average GPU utilization by ~24% and peak usage by ~20%.
This indicates better GPU memory efficiency, allowing longer sequence tasks to run stably under resource-constrained devices.

Example nvidia-smi Output

Original model:

GPU Memory-Usage: 4115MiB / 8188MiB
GPU-Util: 67% (peak 85%)

With TensorRT vision acceleration:

GPU Memory-Usage: 4055MiB / 8188MiB
GPU-Util: 43% (peak 65%)

📑 License

Specify the license here (e.g., Apache 2.0, MIT, or same as MiniVLA / Qwen license).

📚 Citation

If you use MiniVLA in your research or deployment, please cite:

@misc{MiniVLA2025,
  title   = {MiniVLA: A Modular Vision-Language-Action Model for Edge Deployment},
  author  = {Xintao Zhen},
  year    = {2025},
  url     = {https://huggingface.co/xintaozhen/MiniVLA}
}

We also acknowledge and thank the authors of Stanford-ILIAD/minivla-vq-libero90-prismatic, which serves as the base for the checkpoints included in this repository.

xintaozhen
/

MiniVLA

MiniVLA

🔎 Introduction

🏗️ System Architecture

Hybrid Acceleration

📦 Contents

`tensorRT/`
Vision encoder acceleration files:

🔗 Related Project

🚀 Usage

Load Hugging Face Qwen-0.5B

Call TensorRT Vision Encoder (HTTP API)

Call TensorRT-LLM (HTTP API)

🔑 Key Contributions

🖥️ Device & Performance

GPU Memory Utilization (Long-Sequence Tasks)

Example nvidia-smi Output

📑 License

📚 Citation

Model tree for xintaozhen/MiniVLA

MiniVLA

🔎 Introduction

🏗️ System Architecture

Hybrid Acceleration

📦 Contents

tensorRT/Vision encoder acceleration files:

🔗 Related Project

🚀 Usage

Load Hugging Face Qwen-0.5B

Call TensorRT Vision Encoder (HTTP API)

Call TensorRT-LLM (HTTP API)

🔑 Key Contributions

🖥️ Device & Performance

GPU Memory Utilization (Long-Sequence Tasks)

Example nvidia-smi Output

📑 License

📚 Citation

Model tree for xintaozhen/MiniVLA

`tensorRT/`
Vision encoder acceleration files: