|
--- |
|
library_name: transformers |
|
license: cc-by-nc-3.0 |
|
--- |
|
|
|
# Model Card for ReVision-250M-256-16-baseline |
|
|
|
This repository contains **ReVision-250M-256-16-baseline**, a compact **vision-language model (VLM)** designed for **Visual Instruction Rewriting**. The model rewrites **multimodal task-oriented instructions** into text-only commands, enabling privacy-preserving on-device AI by eliminating the need to process images in the cloud. |
|
|
|
## Key Features |
|
- **Lightweight (250M parameters)**: Designed for on-device deployment with efficient inference. |
|
- **Privacy-Preserving**: Converts multimodal inputs into structured text, reducing reliance on cloud-based processing. |
|
- **Fine-Tuned for Instruction Rewriting**: Trained on a dataset of 39,000 examples spanning 14 task-oriented domains. |
|
- **Compact Yet Effective**: Outperforms larger models like PaliGemma-v2 (10B) and QwenVL-7B in instruction rewriting tasks. |
|
|
|
## Model Architecture |
|
- **Vision Encoder**: `google/siglip-base-patch16-256` (processes 256×256 images). |
|
- **Language Model**: `OuteAI/Lite-Mistral-150M-v2-Instruct` (instruction-tuned). |
|
- **Multimodal Fusion**: Uses a linear projector to align vision and language embeddings. |
|
- **Training Dataset**: Pretrained on image captioning datasets (e.g., LLaVA-CC3M, LLaVA-Pretrain) and fine-tuned on the Visual Instruction Rewriting dataset. |
|
|
|
## Performance |
|
| Model | ROUGE-1 | BLEU | Intent Accuracy | Argument Similarity | |
|
|------------------------------------|---------|------|----------------|----------------------| |
|
| ReVision-250M-256-16-baseline | 56.9% | 27.7% | 56.5% | 68.8% | |
|
|
|
|
|
## How to Use |
|
|
|
### Install Dependencies |
|
```bash |
|
pip install torch transformers torchvision |
|
``` |
|
|
|
### Load the Model |
|
```bash |
|
from transformers import AutoProcessor, AutoModelForSeq2SeqLM |
|
import torch |
|
from PIL import Image |
|
|
|
# Load model and processor |
|
model_name = "hsiangfu/ReVision-250M-256-16-baseline" |
|
processor = AutoProcessor.from_pretrained(model_name) |
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_name) |
|
|
|
# Prepare inputs (image + instruction) |
|
image = Image.open("example.jpg") |
|
instruction = "Call this number." |
|
|
|
inputs = processor(images=image, text=instruction, return_tensors="pt") |
|
outputs = model.generate(**inputs) |
|
|
|
# Decode rewritten instruction |
|
rewritten_instruction = processor.batch_decode(outputs, skip_special_tokens=True)[0] |
|
print("Rewritten Instruction:", rewritten_instruction) |
|
``` |
|
|
|
## Dataset |
|
|
|
The model was fine-tuned on the ReVision Multimodal Query Rewrites Dataset, a collection of 39,023 ⟨image, original instruction, rewritten instruction⟩ triplets covering: |
|
|
|
- Books: "Who wrote this book" → "Who wrote 'The Silent Patient'?" |
|
- Business Cards: "Call this number." → "Call 512-555-1234." |
|
- Flyers & Signboards: "Add this event to my calendar." → "Add 'Tech Conference' on May 5 at 2 PM to my calendar." |
|
- Landmarks: "Who made this?" → "Who made the Statue of Liberty?" |
|
- Products: "What brand is this product?" → "What brand made 'Mismatched Sandwich Cremes'?" |
|
- CD covers: "Who made this CD?" → "Who made 'Future'?" |
|
- Paintings: "Who is this painting by?" → "Who made the painting 'Mona Lisa'?" |
|
|
|
Link: https://huggingface.co/datasets/hsiangfu/multimodal_query_rewrites |
|
|
|
|
|
## Applications |
|
|
|
- AR/VR Assistants (e.g., Apple Vision Pro, Meta Ray-Ban Glasses) |
|
- Smartphones & Wearables (on-device AI assistants) |
|
- Accessibility & Assistive AI (for users with visual impairments) |
|
|
|
## Citation |
|
|
|
|
|
|
|
## Acknowledgments |
|
|
|
Developed by researchers at UT Austin and Yale University. Model and dataset are available for academic use. |
|
|