UI-Venus
Collection
7 items
β’
Updated
β’
20
This repository contains the GUI-G2-7B model from the paper GUI-GΒ²: Gaussian Reward Modeling for GUI Grounding. We provided more inference details on the github quick start.
The model is based on Qwen2.5-VL-7B-Instruct
and is fine-tuned using our proposed Gaussian dense reward framework framework.
First, install the required dependencies:
pip install transformers==4.49.0 qwen-vl-utils
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"inclusionAI/GUI-G2-7B",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto")
processor = AutoProcessor.from_pretrained("inclusionAI/GUI-G2-7B")
image_path = ''
instruction = ''
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "image_path",
},
{"type": "text", "text": instruction},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Model | Mobile Text | Mobile Icon | Desktop Text | Desktop Icon | Web Text | Web Icon | Avg. |
---|---|---|---|---|---|---|---|
GPT-4o | 26.6 | 24.2 | 24.2 | 19.3 | 12.8 | 11.8 | 20.1 |
Qwen2.5-VL-3B | 93.4 | 73.5 | 88.1 | 58.6 | 88.0 | 71.4 | 80.9 |
Qwen2.5-VL-7B | 97.6 | 87.2 | 90.2 | 74.2 | 93.2 | 81.3 | 88.8 |
SeeClick-9.6B | 78.4 | 50.7 | 70.1 | 29.3 | 55.2 | 32.5 | 55.1 |
UGround-7B | 75.1 | 84.5 | 85.1 | 61.4 | 84.6 | 71.9 | 76.3 |
OS-Atlas-7B | 95.2 | 75.8 | 90.7 | 63.6 | 90.6 | 77.3 | 84.1 |
UI-TARS-2B | 95.2 | 79.1 | 90.7 | 68.6 | 87.2 | 78.3 | 84.7 |
UI-TARS-7B | 96.9 | 89.1 | 95.4 | 85.0 | 93.6 | 85.2 | 91.6 |
UI-TARS-72B | 94.8 | 86.3 | 91.2 | 87.9 | 91.5 | 87.7 | 90.3 |
JEDI-7B | 96.9 | 87.2 | 95.9 | 87.9 | 94.4 | 84.2 | 91.7 |
GUI-Actor-7B | 97.6 | 88.2 | 96.9 | 85.7 | 93.2 | 86.7 | 92.1 |
UI-R1-3B | 96.2 | 84.3 | 92.3 | 63.6 | 89.2 | 75.4 | 85.4 |
UI-R1-E-3B | 98.2 | 83.9 | 94.8 | 75.0 | 93.2 | 83.7 | 89.5 |
SE-GUI-7B | - | - | - | - | - | - | 90.3 |
LPO | 97.9 | 82.9 | 95.9 | 86.4 | 95.6 | 84.2 | 90.5 |
GUI-GΒ²-7B (Ours) | 98.3 | 91.9 | 95.4 | 89.3 | 94.0 | 87.7 | 93.3 |
The RL Training code build from VLM-R1 project.
If you use GUI-GΒ², please cite our work:
@misc{tang2025guig2gaussianrewardmodeling,
title={GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding},
author={Fei Tang and Zhangxuan Gu and Zhengxi Lu and Xuyang Liu and Shuheng Shen and Changhua Meng and Wen Wang and Wenqi Zhang and Yongliang Shen and Weiming Lu and Jun Xiao and Yueting Zhuang},
year={2025},
eprint={2507.15846},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2507.15846},
}