GUIrilla-See-0.7B
Lightweight vision–language model for GUI element localisation
Summary
GUIrilla-See-0.7B is a 0.7-billion-parameter model derived from Florence 2-large and fine-tuned for open-vocabulary detection in graphical user-interface (GUI) screenshots. Given an image and a free-form textual description, the model returns either
- the bounding box of the best-matching element, or
- a polygon mask, when a bounding box is unavailable.
The model is intended for research on lightweight GUI agents, automated testing, and accessibility tools where a small footprint is preferred over the larger counterpart.
Quick-start
import torch, PIL.Image as Image
from transformers import AutoModelForCausalLM, AutoProcessor
# --- load pipeline -----------------------------------------------------------
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "GUIrilla/GUIrilla-See-0.7B" # 0.7 B weights
dtype = torch.bfloat16 if device == "cuda" else torch.float32
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=dtype, trust_remote_code=True
).to(device)
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
# --- inference ---------------------------------------------------------------
image = Image.open("screenshot.png").convert("RGB")
task_prompt = "<OPEN_VOCABULARY_DETECTION>"
text_query = "button with the label “Submit”"
prompt = task_prompt + text_query
inputs = processor(text=prompt, images=[image], return_tensors="pt").to(device, dtype)
with torch.no_grad():
ids = model.generate(
input_ids = inputs["input_ids"],
pixel_values= inputs["pixel_values"],
max_new_tokens = 1024,
num_beams = 3,
do_sample = False,
early_stopping = False,
)
decoded = processor.batch_decode(ids, skip_special_tokens=False)[0]
result = processor.post_process_generation(
decoded, task=task_prompt, image_size=image.size
)["<OPEN_VOCABULARY_DETECTION>"]
Training Data
Trained on GUIrilla-Task.
- Train data: 25,606 tasks across 881 macOS applications (10% of apps from it for validation)
- Test data: 1,565 tasks across 227 macOS applications
Training Procedure
- 4 epochs LoRA fine-tuning on 1 × A100 40 GB.
- Optimiser – AdamW (β₁ = 0.9, β₂ = 0.95), LR = 5 e-6 with 0.01 warm up ratio.
Evaluation
Split | Success Rate % |
---|---|
Test | 53.55 |
Ethical & Safety Notes
- Always sandbox or use confirmation steps when connecting the model to real GUIs.
- Screenshots may reveal sensitive data – ensure compliance with privacy regulations.
License
MIT (see LICENSE
).
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for GUIrilla/GUIrilla-See-0.7B
Base model
microsoft/Florence-2-large