GroundNext-7B-V0
🌐 Website | 📑 Paper | 🤗 Dataset | 🤖 Model
Highlights
GroundNext-7B-V0 is a state-of-the-art vision-language model for GUI element grounding, developed as part of the GroundCUA project. This model features:
- Superior grounding accuracy achieving 52.9% on ScreenSpot-Pro, 67.7% on OSWorld-G, and 60.3% on UI-Vision benchmarks
- Exceptional cross-platform generalization with 81.1% accuracy on MMBench-GUI and 90.4% on ScreenSpot-v2 despite desktop-only training
- Data-efficient training achieving state-of-the-art results with only 700K training examples vs 9M+ in prior work
- Strong agentic capabilities reaching 50.6% overall success rate on OSWorld when paired with reasoning models
- Native tool-calling support with built-in computer use action space for mouse, keyboard, and screen interactions
Model Overview
GroundNext-7B-V0 has the following characteristics:
- Type: Vision-Language Model for GUI Grounding
- Base Model: Qwen2.5-VL-7B-Instruct
- Training Approach: Two-stage (Supervised Fine-tuning + Reinforcement Learning with RLOO)
- Number of Parameters: 7.0B
- Training Data: 700K human-annotated desktop demonstrations from GroundCUA dataset
- Context Length: 262,144 tokens (inherited from base model)
- Specialization: Desktop GUI element grounding with cross-platform generalization
For more details about the training methodology, dataset, and comprehensive benchmarks, please refer to our paper, GitHub repository, and project website.
Performance
Desktop Grounding Benchmarks
| Qwen2.5-VL-7B | UI-TARS-72B | GroundNext-7B-V0 | |
|---|---|---|---|
| ScreenSpot-Pro | 29.7 | 38.1 | 52.9 |
| OSWorld-G | 42.7 | 57.1 | 67.7 |
| UI-Vision | 16.5 | 25.5 | 60.3 |
| Avg (Desktop) | 29.6 | 40.2 | 60.3 |
Cross-Platform Generalization (Desktop, Mobile & Web)
| Qwen2.5-VL-7B | UI-TARS-72B | GroundNext-7B-V0 | |
|---|---|---|---|
| MMBench-GUI | 33.9 | 74.3 | 81.1 |
| ScreenSpot-v2 | 88.8 | 90.3 | 90.4 |
| Avg (Mobile/Web) | 61.4 | 82.3 | 85.8 |
Agentic Performance on OSWorld
When combined with OpenAI o3 for reasoning, GroundNext-7B-V0 demonstrates strong end-to-end computer use capabilities:
| Model | OS | Office | Daily | Pro | Workflow | Overall |
|---|---|---|---|---|---|---|
| OpenAI o3 | 62.5 | 14.5 | 21.4 | 38.8 | 16.5 | 23.0 |
| CUA | 23.9 | 34.6 | 55.1 | 18.3 | 18.3 | 31.4 |
| OpenCUA-72B | 58.3 | 47.0 | 53.8 | 73.5 | 20.4 | 46.1 |
| UI-TARS-1.5-7B | 33.3 | 29.9 | 37.9 | 53.1 | 9.1 | 29.6 |
| JEDI-7B w/ o3 | 50.0 | 46.1 | 61.9 | 75.5 | 35.3 | 51.0 |
| GroundNext-3B w/ o3 | 62.5 | 47.0 | 55.0 | 73.5 | 36.5 | 50.6 |
Note: GroundNext-7B-V0 results with o3 integration forthcoming.
Quickstart
The code of GroundNext-7B-V0 is compatible with the latest Hugging Face transformers library and follows the Qwen2.5-VL implementation.
With transformers<4.37.0, you may encounter compatibility issues. We recommend using transformers>=4.37.0.
Installation
pip install transformers>=4.37.0 torch torchvision accelerate
pip install qwen-vl-utils # For image processing utilities
Basic Inference
The following code snippet demonstrates how to use GroundNext-7B-V0 for GUI element grounding:
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from PIL import Image
import groundcua
import io
from urllib.request import urlopen
model_name = "ServiceNow/GroundNext-7B-V0"
# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
trust_remote_code=True
).eval()
processor = AutoProcessor.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# Configure generation
model.generation_config.temperature = groundcua.DEFAULT_TEMPERATURE
model.generation_config.do_sample = False
model.generation_config.use_cache = True
# Load and prepare image
url = "https://huggingface.co/datasets/ServiceNow/GroundCUA/resolve/main/images/7-Zip/001f0079a489909eb94e47c2374b7bf36ab1842e314592ce30a34d18a54eb1df.png"
image = Image.open(io.BytesIO(urlopen(url).read()))
image, (width, height) = groundcua.prepare_image(image)
# Create messages and generate
instruction = "Click on the 'File' button"
messages = groundcua.create_messages(instruction, image, width, height)
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[input_text], images=[image], videos=None, padding=True, return_tensors="pt").to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=groundcua.DEFAULT_MAX_NEW_TOKENS)
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
response = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(response)
# Expected output: <tool_call>{"name": "computer_use", "arguments": {"action": "left_click", "coordinate": [x, y]}}</tool_call>
Deployment with vLLM
For production deployment, you can use vLLM to create OpenAI-compatible API endpoints:
vLLM:
vllm serve ServiceNow/GroundNext-7B-V0 --max-model-len 8192
Note: Adjust max-model-len or context-length based on your hardware capabilities. For typical GUI grounding tasks, 8192 tokens is sufficient.
Best Practices
To achieve optimal grounding performance, we recommend:
Image Preprocessing:
- Use high-resolution screenshots (minimum 800x600)
- Ensure UI elements are clearly visible
- Maintain original aspect ratios when resizing
Prompt Engineering:
- Be specific about the target element (e.g., "Click on the blue 'Submit' button in the top-right corner" or "Click on the following element: Save")
- Include element attributes when available (color, position, text)
Generation Parameters:
- Use
temperature=0.0for deterministic grounding - Set
max_new_tokens=128(sufficient for tool calls) - Enable
use_cache=Truefor faster inference
- Use
System Prompt:
- Always include the system prompt with actual screen dimensions
- Replace
{width}and{height}with true screenshot dimensions - Maintain the tool signature format for proper JSON parsing
Post-processing:
- Parse
<tool_call>tags to extract JSON - Validate coordinates are within screen bounds
- Parse
Training
GroundNext-7B-V0 was trained using a two-stage approach:
- Supervised Fine-tuning (SFT): Trained on 700K human-annotated desktop demonstrations from the GroundCUA dataset
- Reinforcement Learning (RLOO): Further optimized using reward-based learning with custom GUI grounding rewards
For detailed training instructions, dataset preparation, and reproduction steps, please visit our GitHub repository.
Limitations and Future Work
- Desktop-focused: Primarily trained on desktop environments (though shows strong cross-platform generalization)
- Action space: Currently supports mouse click action only
- Languages: Optimized for English UI elements
- Resolution: Performance may vary with extremely high or low resolution images
Citation
If you use GroundNext-7B-V0 in your research, please cite:
@misc{feizi2025groundingcomputeruseagents,
title={Grounding Computer Use Agents on Human Demonstrations},
author={Aarash Feizi and Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Kaixin Li and Rabiul Awal and Xing Han Lù and Johan Obando-Ceron and Juan A. Rodriguez and Nicolas Chapados and David Vazquez and Adriana Romero-Soriano and Reihaneh Rabbany and Perouz Taslakian and Christopher Pal and Spandana Gella and Sai Rajeswar},
year={2025},
eprint={2511.07332},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2511.07332},
}
License
This model is released under the Apache 2.0 License, following the base Qwen2.5-VL-7B-Instruct model. See the LICENSE for details.
Acknowledgements
We thank:
- The Qwen team for the excellent Qwen2.5-VL foundation models
- The open-source community for tools and frameworks that made this work possible
- Human annotators who contributed to the GroundCUA dataset
- Downloads last month
- 260