ReVision-250M-256-16-baseline / README.md

Update README.md

b6e9715 verified 4 months ago

3.67 kB

	---
	library_name: transformers
	license: cc-by-nc-3.0
	---

	# Model Card for ReVision-250M-256-16-baseline

	This repository contains ReVision-250M-256-16-baseline, a compact vision-language model (VLM) designed for Visual Instruction Rewriting. The model rewrites multimodal task-oriented instructions into text-only commands, enabling privacy-preserving on-device AI by eliminating the need to process images in the cloud.

	## Key Features
	- Lightweight (250M parameters): Designed for on-device deployment with efficient inference.
	- Privacy-Preserving: Converts multimodal inputs into structured text, reducing reliance on cloud-based processing.
	- Fine-Tuned for Instruction Rewriting: Trained on a dataset of 39,000 examples spanning 14 task-oriented domains.
	- Compact Yet Effective: Outperforms larger models like PaliGemma-v2 (10B) and QwenVL-7B in instruction rewriting tasks.

	## Model Architecture
	- Vision Encoder: `google/siglip-base-patch16-256` (processes 256×256 images).
	- Language Model: `OuteAI/Lite-Mistral-150M-v2-Instruct` (instruction-tuned).
	- Multimodal Fusion: Uses a linear projector to align vision and language embeddings.
	- Training Dataset: Pretrained on image captioning datasets (e.g., LLaVA-CC3M, LLaVA-Pretrain) and fine-tuned on the Visual Instruction Rewriting dataset.

	## Performance
	\| Model \| ROUGE-1 \| BLEU \| Intent Accuracy \| Argument Similarity \|
	\|------------------------------------\|---------\|------\|----------------\|----------------------\|
	\| ReVision-250M-256-16-baseline \| 56.9% \| 27.7% \| 56.5% \| 68.8% \|


	## How to Use

	### Install Dependencies
	```bash
	pip install torch transformers torchvision
	```

	### Load the Model
	```bash
	from transformers import AutoProcessor, AutoModelForSeq2SeqLM
	import torch
	from PIL import Image

	# Load model and processor
	model_name = "hsiangfu/ReVision-250M-256-16-baseline"
	processor = AutoProcessor.from_pretrained(model_name)
	model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

	# Prepare inputs (image + instruction)
	image = Image.open("example.jpg")
	instruction = "Call this number."

	inputs = processor(images=image, text=instruction, return_tensors="pt")
	outputs = model.generate(**inputs)

	# Decode rewritten instruction
	rewritten_instruction = processor.batch_decode(outputs, skip_special_tokens=True)[0]
	print("Rewritten Instruction:", rewritten_instruction)
	```

	## Dataset

	The model was fine-tuned on the ReVision Multimodal Query Rewrites Dataset, a collection of 39,023 ⟨image, original instruction, rewritten instruction⟩ triplets covering:

	- Books: "Who wrote this book" → "Who wrote 'The Silent Patient'?"
	- Business Cards: "Call this number." → "Call 512-555-1234."
	- Flyers & Signboards: "Add this event to my calendar." → "Add 'Tech Conference' on May 5 at 2 PM to my calendar."
	- Landmarks: "Who made this?" → "Who made the Statue of Liberty?"
	- Products: "What brand is this product?" → "What brand made 'Mismatched Sandwich Cremes'?"
	- CD covers: "Who made this CD?" → "Who made 'Future'?"
	- Paintings: "Who is this painting by?" → "Who made the painting 'Mona Lisa'?"

	Link: https://huggingface.co/datasets/hsiangfu/multimodal_query_rewrites


	## Applications

	- AR/VR Assistants (e.g., Apple Vision Pro, Meta Ray-Ban Glasses)
	- Smartphones & Wearables (on-device AI assistants)
	- Accessibility & Assistive AI (for users with visual impairments)

	## Citation



	## Acknowledgments

	Developed by researchers at UT Austin and Yale University. Model and dataset are available for academic use.