Model Card for ReVision-250M-256-16-baseline

This repository contains ReVision-250M-256-16-baseline, a compact vision-language model (VLM) designed for Visual Instruction Rewriting. The model rewrites multimodal task-oriented instructions into text-only commands, enabling privacy-preserving on-device AI by eliminating the need to process images in the cloud.

Key Features

  • Lightweight (250M parameters): Designed for on-device deployment with efficient inference.
  • Privacy-Preserving: Converts multimodal inputs into structured text, reducing reliance on cloud-based processing.
  • Fine-Tuned for Instruction Rewriting: Trained on a dataset of 39,000 examples spanning 14 task-oriented domains.
  • Compact Yet Effective: Outperforms larger models like PaliGemma-v2 (10B) and QwenVL-7B in instruction rewriting tasks.

Model Architecture

  • Vision Encoder: google/siglip-base-patch16-256 (processes 256×256 images).
  • Language Model: OuteAI/Lite-Mistral-150M-v2-Instruct (instruction-tuned).
  • Multimodal Fusion: Uses a linear projector to align vision and language embeddings.
  • Training Dataset: Pretrained on image captioning datasets (e.g., LLaVA-CC3M, LLaVA-Pretrain) and fine-tuned on the Visual Instruction Rewriting dataset.

Performance

Model ROUGE-1 BLEU Intent Accuracy Argument Similarity
ReVision-250M-256-16-baseline 56.9% 27.7% 56.5% 68.8%

How to Use

Install Dependencies

pip install torch transformers torchvision  

Load the Model

from transformers import AutoProcessor, AutoModelForSeq2SeqLM
import torch
from PIL import Image

# Load model and processor
model_name = "hsiangfu/ReVision-250M-256-16-baseline"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Prepare inputs (image + instruction)
image = Image.open("example.jpg")
instruction = "Call this number."

inputs = processor(images=image, text=instruction, return_tensors="pt")
outputs = model.generate(**inputs)

# Decode rewritten instruction
rewritten_instruction = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print("Rewritten Instruction:", rewritten_instruction)

Dataset

The model was fine-tuned on the ReVision Multimodal Query Rewrites Dataset, a collection of 39,023 ⟨image, original instruction, rewritten instruction⟩ triplets covering:

  • Books: "Who wrote this book" → "Who wrote 'The Silent Patient'?"
  • Business Cards: "Call this number." → "Call 512-555-1234."
  • Flyers & Signboards: "Add this event to my calendar." → "Add 'Tech Conference' on May 5 at 2 PM to my calendar."
  • Landmarks: "Who made this?" → "Who made the Statue of Liberty?"
  • Products: "What brand is this product?" → "What brand made 'Mismatched Sandwich Cremes'?"
  • CD covers: "Who made this CD?" → "Who made 'Future'?"
  • Paintings: "Who is this painting by?" → "Who made the painting 'Mona Lisa'?"

Link: https://huggingface.co/datasets/hsiangfu/multimodal_query_rewrites

Applications

  • AR/VR Assistants (e.g., Apple Vision Pro, Meta Ray-Ban Glasses)
  • Smartphones & Wearables (on-device AI assistants)
  • Accessibility & Assistive AI (for users with visual impairments)

Citation

Acknowledgments

Developed by researchers at UT Austin and Yale University. Model and dataset are available for academic use.

Downloads last month
19
Safetensors
Model size
275M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support