GUIrilla-See-0.7B

Lightweight vision–language model for GUI element localisation

Summary

GUIrilla-See-0.7B is a 0.7-billion-parameter model derived from Florence 2-large and fine-tuned for open-vocabulary detection in graphical user-interface (GUI) screenshots. Given an image and a free-form textual description, the model returns either

the bounding box of the best-matching element, or
a polygon mask, when a bounding box is unavailable.

The model is intended for research on lightweight GUI agents, automated testing, and accessibility tools where a small footprint is preferred over the larger counterpart.

Quick-start

import torch, PIL.Image as Image
from transformers import AutoModelForCausalLM, AutoProcessor

# --- load pipeline -----------------------------------------------------------
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "GUIrilla/GUIrilla-See-0.7B"        # 0.7 B weights
dtype = torch.bfloat16 if device == "cuda" else torch.float32

model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=dtype, trust_remote_code=True
).to(device)

processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

# --- inference ---------------------------------------------------------------
image = Image.open("screenshot.png").convert("RGB")
task_prompt = "<OPEN_VOCABULARY_DETECTION>"
text_query  = "button with the label “Submit”"

prompt = task_prompt + text_query
inputs = processor(text=prompt, images=[image], return_tensors="pt").to(device, dtype)

with torch.no_grad():
    ids = model.generate(
        input_ids   = inputs["input_ids"],
        pixel_values= inputs["pixel_values"],
        max_new_tokens = 1024,
        num_beams      = 3,
        do_sample      = False,
        early_stopping = False,
    )

decoded = processor.batch_decode(ids, skip_special_tokens=False)[0]
result  = processor.post_process_generation(
    decoded, task=task_prompt, image_size=image.size
)["<OPEN_VOCABULARY_DETECTION>"]

Training Data

Trained on GUIrilla-Task.

Train data: 25,606 tasks across 881 macOS applications (10% of apps from it for validation)
Test data: 1,565 tasks across 227 macOS applications

Training Procedure

4 epochs LoRA fine-tuning on 1 × A100 40 GB.
Optimiser – AdamW (β₁ = 0.9, β₂ = 0.95), LR = 5 e-6 with 0.01 warm up ratio.

Evaluation

Split	Success Rate %
Test	53.55

Ethical & Safety Notes

Always sandbox or use confirmation steps when connecting the model to real GUIs.
Screenshots may reveal sensitive data – ensure compliance with privacy regulations.

License

MIT (see LICENSE).

GUIrilla
/

GUIrilla-See-0.7B