Safetensors
qwen2_vl
Ray2333 commited on
Commit
9961486
Β·
verified Β·
1 Parent(s): 71537da

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +202 -3
README.md CHANGED
@@ -1,3 +1,202 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - OS-Copilot/OS-Atlas-data
5
+ base_model:
6
+ - ByteDance-Seed/UI-TARS-2B-SFT
7
+ ---
8
+
9
+ # GUI-Actor-Verifier-2B
10
+
11
+
12
+ This model was introduced in the paper [**GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents**](https://aka.ms/GUI-Actor).
13
+ It is developed based on [UI-TARS-2B-SFT](https://huggingface.co/ByteDance-Seed/UI-TARS-2B-SFT) and is designed to predict the correctness of an action position given a language instruction. This model is well-suited for **GUI-Actor**, as its attention map effectively provides diverse candidates for verification with only a single inference.
14
+
15
+
16
+ For more details on model design and evaluation, please check: [🏠 Project Page](https://aka.ms/GUI-Actor) | [πŸ’» Github Repo](https://github.com/microsoft/GUI-Actor) | [πŸ“‘ Paper]().
17
+
18
+
19
+ | Model List | Hugging Face Link |
20
+ |--------------------------------------------|--------------------------------------------|
21
+ | **GUI-Actor-7B-Qwen2-VL** | [πŸ€— Hugging Face](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2-VL) |
22
+ | **GUI-Actor-2B-Qwen2-VL** | [πŸ€— Hugging Face](https://huggingface.co/microsoft/GUI-Actor-2B-Qwen2-VL) |
23
+ | **GUI-Actor-7B-Qwen2.5-VL (coming soon)** | [πŸ€— Hugging Face](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2.5-VL) |
24
+ | **GUI-Actor-3B-Qwen2.5-VL (coming soon)** | [πŸ€— Hugging Face](https://huggingface.co/microsoft/GUI-Actor-3B-Qwen2.5-VL) |
25
+ | **GUI-Actor-Verifier-2B** | [πŸ€— Hugging Face](https://huggingface.co/microsoft/GUI-Actor-Verifier-2B) |
26
+
27
+
28
+
29
+ ## πŸ“Š Performance Comparison on GUI Grounding Benchmarks
30
+ Table 1. Main results on ScreenSpot-Pro, ScreenSpot, and ScreenSpot-v2 with **Qwen2-VL** as the backbone. † indicates scores obtained from our own evaluation of the official models on Huggingface.
31
+ | Method | Backbone VLM | ScreenSpot-Pro | ScreenSpot | ScreenSpot-v2 |
32
+ |------------------|--------------|----------------|------------|----------------|
33
+ | **_72B models:_**
34
+ | AGUVIS-72B | Qwen2-VL | - | 89.2 | - |
35
+ | UGround-V1-72B | Qwen2-VL | 34.5 | **89.4** | - |
36
+ | UI-TARS-72B | Qwen2-VL | **38.1** | 88.4 | **90.3** |
37
+ | **_7B models:_**
38
+ | OS-Atlas-7B | Qwen2-VL | 18.9 | 82.5 | 84.1 |
39
+ | AGUVIS-7B | Qwen2-VL | 22.9 | 84.4 | 86.0† |
40
+ | UGround-V1-7B | Qwen2-VL | 31.1 | 86.3 | 87.6† |
41
+ | UI-TARS-7B | Qwen2-VL | 35.7 | 89.5 | **91.6** |
42
+ | GUI-Actor-7B | Qwen2-VL | 40.7 | 88.3 | 89.5 |
43
+ | GUI-Actor-7B + Verifier | Qwen2-VL | **44.2** | **89.7** | 90.9 |
44
+ | **_2B models:_**
45
+ | UGround-V1-2B | Qwen2-VL | 26.6 | 77.1 | - |
46
+ | UI-TARS-2B | Qwen2-VL | 27.7 | 82.3 | 84.7 |
47
+ | GUI-Actor-2B | Qwen2-VL | 36.7 | 86.5 | 88.6 |
48
+ | GUI-Actor-2B + Verifier | Qwen2-VL | **41.8** | **86.9** | **89.3** |
49
+
50
+ Table 2. Main results on the ScreenSpot-Pro and ScreenSpot-v2 with **Qwen2.5-VL** as the backbone.
51
+ | Method | Backbone VLM | ScreenSpot-Pro | ScreenSpot-v2 |
52
+ |----------------|---------------|----------------|----------------|
53
+ | **_7B models:_**
54
+ | Qwen2.5-VL-7B | Qwen2.5-VL | 27.6 | 88.8 |
55
+ | Jedi-7B | Qwen2.5-VL | 39.5 | 91.7 |
56
+ | GUI-Actor-7B | Qwen2.5-VL | 44.6 | 92.1 |
57
+ | GUI-Actor-7B + Verifier | Qwen2.5-VL | **47.7** | **92.5** |
58
+ | **_3B models:_**
59
+ | Qwen2.5-VL-3B | Qwen2.5-VL | 25.9 | 80.9 |
60
+ | Jedi-3B | Qwen2.5-VL | 36.1 | 88.6 |
61
+ | GUI-Actor-3B | Qwen2.5-VL | 42.2 | 91.0 |
62
+ | GUI-Actor-3B + Verifier | Qwen2.5-VL | **45.9** | **92.4** |
63
+
64
+ ## πŸš€ Usage
65
+ The verifier takes a language instruction and an image with a red circle marking the target position as input. One example is shown below. It outputs either β€˜True’ or β€˜False’, and you can also use the probability of each label to score the sample.
66
+
67
+ For more detailed usage, please refer to our github repo.
68
+
69
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/64d45451c34a346181b130dd/1LTBORYJsO9Ru6B4q_SKl.png" alt="image" width="500"/>
70
+
71
+
72
+ ```python
73
+ import torch
74
+ from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
75
+ from transformers.generation import GenerationConfig
76
+ import json
77
+ import re
78
+ import os
79
+ import numpy as np
80
+ from PIL import Image, ImageDraw
81
+ from qwen_vl_utils import process_vision_info
82
+
83
+
84
+
85
+ # load model
86
+ model_name_or_path = "microsoft/GUI-Actor-Verifier-2B"
87
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
88
+ model_name_or_path,
89
+ device_map="cuda:0",
90
+ trust_remote_code=True,
91
+ torch_dtype=torch.bfloat16,
92
+ attn_implementation="flash_attention_2"
93
+ ).eval()
94
+ output_len = 1
95
+
96
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
97
+ processor = AutoProcessor.from_pretrained(model_name_or_path)
98
+
99
+ def draw_annotations(img, point_in_pixel, bbox, output_path='test.png', color='red', size=1):
100
+ draw = ImageDraw.Draw(img)
101
+
102
+ # Draw the ground truth bounding box in green
103
+ if bbox:
104
+ # Assuming bbox format is [x1, y1, x2, y2]
105
+ draw.rectangle(bbox, outline="yellow", width=4)
106
+
107
+ # Draw a small circle around the predicted point in red
108
+ if point_in_pixel:
109
+ # Create a small rectangle around the point (5 pixels in each direction)
110
+ radius = np.ceil(8 * size).astype(int)
111
+ circle_bbox = [
112
+ point_in_pixel[0] - radius, # x1
113
+ point_in_pixel[1] - radius, # y1
114
+ point_in_pixel[0] + radius, # x2
115
+ point_in_pixel[1] + radius # y2
116
+ ]
117
+ draw.ellipse(circle_bbox, outline=color, width=np.ceil(4 * size).astype(int))
118
+
119
+ return img
120
+
121
+ def ground_only_positive(model, tokenizer, processor, instruction, image, point):
122
+ if isinstance(image, str):
123
+ image_path = image
124
+ image = Image.open(image_path)
125
+ else:
126
+ image_path = image_to_temp_filename(image)
127
+ assert os.path.exists(image_path) and os.path.isfile(image_path), "Invalid input image path."
128
+
129
+ width, height = image.size
130
+ image = draw_annotations(image, point, None, output_path=None, size=height/1000 * 1.2)
131
+
132
+ prompt_origin = "Please observe the screenshot and exame whether the hollow red circle accurately placed on the intended position in the image: '{}'. Answer True or False."
133
+ full_prompt = prompt_origin.format(instruction)
134
+
135
+ messages = [
136
+ {
137
+ "role": "user",
138
+ "content": [
139
+ {
140
+ "type": "image",
141
+ "image": image,
142
+ },
143
+ {"type": "text", "text": full_prompt},
144
+ ],
145
+ }
146
+ ]
147
+ # Preparation for inference
148
+ text_input = processor.apply_chat_template(
149
+ messages, tokenize=False, add_generation_prompt=True
150
+ )
151
+ image_inputs, video_inputs = process_vision_info(messages)
152
+ inputs = processor(
153
+ text=[text_input],
154
+ images=image_inputs,
155
+ videos=video_inputs,
156
+ padding=True,
157
+ return_tensors="pt",
158
+ )
159
+ inputs = inputs.to("cuda:0")
160
+
161
+ generated_ids = model.generate(
162
+ **inputs,
163
+ max_new_tokens=output_len,
164
+ do_sample=False,
165
+ temperature=0.0
166
+ )
167
+
168
+ generated_ids_trimmed = [
169
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
170
+ ]
171
+ response = processor.batch_decode(
172
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
173
+ )[0]
174
+
175
+ print(response)
176
+ matches = re.findall(r'\b(?:True|False)\b', response)
177
+ if not len(matches):
178
+ answer = 'Error Format'
179
+ else:
180
+ answer = matches[-1]
181
+ return answer
182
+
183
+ # given the image path and instruction and coorindate
184
+ instruction = 'close this window'
185
+ image = Image.open('test.png')
186
+ width, height = image.size
187
+ point = [int(0.9709 * width), int(0.1548, * height)] # The point should be in pixels
188
+ answer = ground_only_positive(model, tokenizer, processor, instruction, image, point) # output True or False
189
+ ```
190
+
191
+ ## πŸ“ Citation
192
+ ```
193
+ @article{wu2025guiactor,
194
+ title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents},
195
+ author={Qianhui Wu and Kanzhi Cheng and Rui Yang and Chaoyun Zhang and Jianwei Yang and Huiqiang Jiang and Jian Mu and Baolin Peng and Bo Qiao and Reuben Tan and Si Qin and Lars Liden and Qingwei Lin and Huan Zhang and Tong Zhang and Jianbing Zhang and Dongmei Zhang and Jianfeng Gao},
196
+ year={2025},
197
+ eprint={},
198
+ archivePrefix={arXiv},
199
+ primaryClass={cs.CV},
200
+ url={},
201
+ }
202
+ ```