GUI-Actor-Verifier-2B

This model was introduced in the paper GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents. It is developed based on UI-TARS-2B-SFT and is designed to predict the correctness of an action position given a language instruction. This model is well-suited for GUI-Actor, as its attention map effectively provides diverse candidates for verification with only a single inference.

For more details on model design and evaluation, please check: 🏠 Project Page | 💻 Github Repo | 📑 Paper.

Model List	Hugging Face Link
GUI-Actor-7B-Qwen2-VL	🤗 Hugging Face
GUI-Actor-2B-Qwen2-VL	🤗 Hugging Face
GUI-Actor-7B-Qwen2.5-VL (coming soon)	🤗 Hugging Face
GUI-Actor-3B-Qwen2.5-VL (coming soon)	🤗 Hugging Face
GUI-Actor-Verifier-2B	🤗 Hugging Face

📊 Performance Comparison on GUI Grounding Benchmarks

Table 1. Main results on ScreenSpot-Pro, ScreenSpot, and ScreenSpot-v2 with Qwen2-VL as the backbone. † indicates scores obtained from our own evaluation of the official models on Huggingface.

Method	Backbone VLM	ScreenSpot-Pro	ScreenSpot	ScreenSpot-v2
*72B models:*
AGUVIS-72B	Qwen2-VL	-	89.2	-
UGround-V1-72B	Qwen2-VL	34.5	89.4	-
UI-TARS-72B	Qwen2-VL	38.1	88.4	90.3
*7B models:*
OS-Atlas-7B	Qwen2-VL	18.9	82.5	84.1
AGUVIS-7B	Qwen2-VL	22.9	84.4	86.0†
UGround-V1-7B	Qwen2-VL	31.1	86.3	87.6†
UI-TARS-7B	Qwen2-VL	35.7	89.5	91.6
GUI-Actor-7B	Qwen2-VL	40.7	88.3	89.5
GUI-Actor-7B + Verifier	Qwen2-VL	44.2	89.7	90.9
*2B models:*
UGround-V1-2B	Qwen2-VL	26.6	77.1	-
UI-TARS-2B	Qwen2-VL	27.7	82.3	84.7
GUI-Actor-2B	Qwen2-VL	36.7	86.5	88.6
GUI-Actor-2B + Verifier	Qwen2-VL	41.8	86.9	89.3

Table 2. Main results on the ScreenSpot-Pro and ScreenSpot-v2 with Qwen2.5-VL as the backbone.

Method	Backbone VLM	ScreenSpot-Pro	ScreenSpot-v2
*7B models:*
Qwen2.5-VL-7B	Qwen2.5-VL	27.6	88.8
Jedi-7B	Qwen2.5-VL	39.5	91.7
GUI-Actor-7B	Qwen2.5-VL	44.6	92.1
GUI-Actor-7B + Verifier	Qwen2.5-VL	47.7	92.5
*3B models:*
Qwen2.5-VL-3B	Qwen2.5-VL	25.9	80.9
Jedi-3B	Qwen2.5-VL	36.1	88.6
GUI-Actor-3B	Qwen2.5-VL	42.2	91.0
GUI-Actor-3B + Verifier	Qwen2.5-VL	45.9	92.4

🚀 Usage

The verifier takes a language instruction and an image with a red circle marking the target position as input. One example is shown below. It outputs either ‘True’ or ‘False’, and you can also use the probability of each label to score the sample.

For more detailed usage, please refer to our github repo.

import torch
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from transformers.generation import GenerationConfig
import json
import re
import os
import numpy as np
from PIL import Image, ImageDraw
from qwen_vl_utils import process_vision_info



# load model
model_name_or_path = "microsoft/GUI-Actor-Verifier-2B"
model = Qwen2VLForConditionalGeneration.from_pretrained(
            model_name_or_path, 
            device_map="cuda:0", 
            trust_remote_code=True, 
            torch_dtype=torch.bfloat16,
            attn_implementation="flash_attention_2"
        ).eval()
output_len = 1

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_name_or_path)

def draw_annotations(img, point_in_pixel, bbox, output_path='test.png', color='red', size=1):
    draw = ImageDraw.Draw(img)
    
    # Draw the ground truth bounding box in green
    if bbox:
        # Assuming bbox format is [x1, y1, x2, y2]
        draw.rectangle(bbox, outline="yellow", width=4)
    
    # Draw a small circle around the predicted point in red
    if point_in_pixel:
        # Create a small rectangle around the point (5 pixels in each direction)
        radius = np.ceil(8 * size).astype(int)
        circle_bbox = [
            point_in_pixel[0] - radius,  # x1
            point_in_pixel[1] - radius,  # y1
            point_in_pixel[0] + radius,  # x2
            point_in_pixel[1] + radius   # y2
        ]
        draw.ellipse(circle_bbox, outline=color, width=np.ceil(4 * size).astype(int))
    
    return img

def ground_only_positive(model, tokenizer, processor, instruction, image, point):
  if isinstance(image, str):
      image_path = image
      image = Image.open(image_path)
  else:
      image_path = image_to_temp_filename(image)
  assert os.path.exists(image_path) and os.path.isfile(image_path), "Invalid input image path."

  width, height = image.size
  image = draw_annotations(image, point, None, output_path=None, size=height/1000 * 1.2)

  prompt_origin = "Please observe the screenshot and exame whether the hollow red circle accurately placed on the intended position in the image: '{}'. Answer True or False."
  full_prompt = prompt_origin.format(instruction)

  messages = [
      {
          "role": "user",
          "content": [
              {
                  "type": "image",
                  "image": image,
              },
              {"type": "text", "text": full_prompt},
          ],
      }
  ]
  # Preparation for inference
  text_input = processor.apply_chat_template(
      messages, tokenize=False, add_generation_prompt=True
  )
  image_inputs, video_inputs = process_vision_info(messages)
  inputs = processor(
      text=[text_input],
      images=image_inputs,
      videos=video_inputs,
      padding=True,
      return_tensors="pt",
  )
  inputs = inputs.to("cuda:0")

  generated_ids = model.generate(
      **inputs,  
      max_new_tokens=output_len,
      do_sample=False,
      temperature=0.0
  )

  generated_ids_trimmed = [
      out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
  ]
  response = processor.batch_decode(
      generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
  )[0]

  print(response)
  matches = re.findall(r'\b(?:True|False)\b', response)
  if not len(matches):
      answer = 'Error Format'
  else:
      answer = matches[-1]
  return answer

# given the image path and instruction and coorindate
instruction = 'close this window'
image = Image.open('test.png')
width, height = image.size
point = [int(0.9709 * width), int(0.1548, * height)] # The point should be in pixels
answer = ground_only_positive(model, tokenizer, processor, instruction, image, point) # output True or False

📝 Citation

@article{wu2025guiactor,
    title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents}, 
    author={Qianhui Wu and Kanzhi Cheng and Rui Yang and Chaoyun Zhang and Jianwei Yang and Huiqiang Jiang and Jian Mu and Baolin Peng and Bo Qiao and Reuben Tan and Si Qin and Lars Liden and Qingwei Lin and Huan Zhang and Tong Zhang and Jianbing Zhang and Dongmei Zhang and Jianfeng Gao},
    year={2025},
    eprint={2506.03143},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://www.arxiv.org/pdf/2506.03143},
}

microsoft
/

GUI-Actor-Verifier-2B

GUI-Actor-Verifier-2B

📊 Performance Comparison on GUI Grounding Benchmarks

🚀 Usage

📝 Citation

Model tree for microsoft/GUI-Actor-Verifier-2B

Dataset used to train microsoft/GUI-Actor-Verifier-2B