MiniVLA / README.md

Update README.md

055b5dd verified 2 days ago

7.02 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- vision-language-action
	- edge-deployment
	- tensorRT
	- qwen
	base_model: Stanford-ILIAD/minivla-vq-libero90-prismatic
	library_name: transformers
	datasets:
	- LIBERO
	pipeline_tag: image-text-to-text
	---

	# MiniVLA

	This repository hosts MiniVLA – a modular and deployment-friendly Vision-Language-Action (VLA) model designed for edge hardware (e.g., Jetson Orin Nano).
	It contains model checkpoints, Hugging Face–compatible Qwen-0.5B LLM, and ONNX/TensorRT exports for accelerated inference.

	---

	## 🔎 Introduction

	To enable low-latency, high-security desktop robot tasks on local devices, this project focuses on addressing the deployment and performance challenges of lightweight multimodal models on edge hardware. Using OpenVLA-Mini as a case study, we propose a hybrid acceleration pipeline designed to alleviate deployment bottlenecks on resource-constrained platforms.

	We reproduced a lightweight VLA model and then significantly reduced its end-to-end latency and GPU memory usage by exporting the vision encoder into ONNX and TensorRT engines. While we observed a moderate drop in the task success rate (around 5-10% in LIBERO desktop operation tasks), our results still demonstrate the feasibility of achieving efficient, real-time VLA inference on the edge side.

	---

	## 🏗️ System Architecture

	The MiniVLA deployment is designed with modular microservices:

	<p align="center">
	<img src="./Results/System_Architecture.svg" width="100%" >
	</p>


	- Inputs: image + language instruction
	- Vision Encoder: DinoV2 / SigLIP → ONNX/TensorRT
	- LLM: Qwen 2.5 0.5B (Hugging Face / TensorRT-LLM)
	- Router & Fallback: balances between local inference and accelerated microservices
	- Robot Action: decoded from predicted action tokens

	### Hybrid Acceleration

	<p align="center">
	<img src="./Results/MiniVLA_Architecture.svg" width="100%" >
	</p>


	- Vision Encoder Acceleration: PyTorch → ONNX → TensorRT, deployed as microservice (`/vision/encode`)
	- LLM Acceleration: Hugging Face → TensorRT-LLM engine, deployed as microservice (`/llm/generate`)
	- Main Process: Orchestrates requests, ensures fallback, and outputs robot actions

	---

	## 📦 Contents

	- `models/`
	Contains the original MiniVLA model checkpoints, based on
	[Stanford-ILIAD/minivla-vq-libero90-prismatic](https://huggingface.co/Stanford-ILIAD/minivla-vq-libero90-prismatic).
	Special thanks to the Stanford ILIAD team for their open-source contribution.

	- `qwen25-0_5b-trtllm/`
	Qwen-0.5B language model converted to TensorRT-LLM format.

	- `qwen25-0_5b-with-extra-tokenizer/`
	Hugging Face–compatible Qwen-0.5B model with extended tokenizer.

	- `tensorRT/`
	Vision encoder acceleration files:
	-
	`vision_encoder_fp16.onnx`
	- `vision_encoder_fp16.engine`

	---


	## 🔗 Related Project

	For full implementation and code, please visit the companion GitHub repository:
	👉 [https://github.com/Zhenxintao/MiniVLA](https://github.com/Zhenxintao/MiniVLA)


	## 🚀 Usage

	### Load Hugging Face Qwen-0.5B

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_name = "xintaozhen/MiniVLA/qwen25-0_5b-with-extra-tokenizer"
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
	```

	### Call TensorRT Vision Encoder (HTTP API)

	```python
	import requests

	url = "http://vision.svc:8000/vision/encode"
	image_data = {"image": "base64_encoded_image"}
	response = requests.post(url, json=image_data)
	vision_embedding = response.json()
	```

	### Call TensorRT-LLM (HTTP API)

	```python

	import requests

	url = "http://llm.svc:8810/llm/generate"
	payload = {"prompt": "Close the top drawer of the cabinet."}
	response = requests.post(url, json=payload)
	generated_actions = response.json()
	```

	---

	## 🔑 Key Contributions

	- Built an end-to-end online inference framework with a FastAPI service (`/act`), transforming offline benchmark code into a real-time deployable system.
	- Reproduced a lightweight OpenVLA-Mini and proposed a hybrid acceleration pipeline.
	- Exported the vision encoder to TensorRT, reducing perception latency and GPU memory usage.
	- Improved GPU memory efficiency: reduced average utilization from ~67% to ~43%, and peak usage from ~85% to ~65%, making deployment feasible under 8 GB memory constraints (similar to Jetson-class devices).
	- Integrated Qwen 2.5 0.5B in Hugging Face and TensorRT-LLM formats.
	- Designed a modular system architecture with router & fallback for robustness.
	- Demonstrated efficient edge-side VLA inference on Jetson Orin Nano in LIBERO tasks, with only a moderate performance drop (5–10%).

	---

	## 🖥️ Device & Performance

	Target deployment: Jetson Orin Nano (16 GB / 8 GB variants).

	For simulation and reproducibility, experiments were conducted on a local workstation equipped with:

	- GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8 GB VRAM)
	- Driver / CUDA: Driver 550.144.03, CUDA 12.4
	- OS: Ubuntu 22.04 LTS

	⚠️ Note: Although the experiments were run on RTX 4060, the GPU memory (8 GB) is comparable to entry-level Jetson devices, making it a suitable proxy for evaluating edge deployment feasibility.

	### GPU Memory Utilization (Long-Sequence Tasks)

	\| Model Variant \| Avg. GPU Utilization \| Peak GPU Utilization \|
	\| --------------------------------------- \| -------------------- \| -------------------- \|
	\| Original MiniVLA (PyTorch, no TRT) \| ~67% \| ~85% \|
	\| MiniVLA w/ TensorRT Vision Acceleration \| ~43% \| ~65% \|

	Observation:

	- The hybrid acceleration pipeline (TensorRT vision + VLA main process) reduced average GPU utilization by ~24% and peak usage by ~20%.
	- This indicates better GPU memory efficiency, allowing longer sequence tasks to run stably under resource-constrained devices.

	### Example nvidia-smi Output

	Original model:

	```
	GPU Memory-Usage: 4115MiB / 8188MiB
	GPU-Util: 67% (peak 85%)
	```

	With TensorRT vision acceleration:

	```
	GPU Memory-Usage: 4055MiB / 8188MiB
	GPU-Util: 43% (peak 65%)
	```

	---

	## 📑 License

	Specify the license here (e.g., Apache 2.0, MIT, or same as MiniVLA / Qwen license).

	---

	## 📚 Citation

	If you use MiniVLA in your research or deployment, please cite:

	```
	@misc{MiniVLA2025,
	title = {MiniVLA: A Modular Vision-Language-Action Model for Edge Deployment},
	author = {Xintao Zhen},
	year = {2025},
	url = {https://huggingface.co/xintaozhen/MiniVLA}
	}
	```

	We also acknowledge and thank the authors of [Stanford-ILIAD/minivla-vq-libero90-prismatic](https://huggingface.co/Stanford-ILIAD/minivla-vq-libero90-prismatic), which serves as the base for the checkpoints included in this repository.

	---