Image-Text-to-Text
Transformers
Safetensors
huangzhiyuan
Improve model card
60aa2d9
metadata
base_model:
  - InternVL/InternVL2-26B
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text

SpiritSight Agent: Advanced GUI Agent with One Look

πŸ“„ Paper β€’ πŸ€– Models β€’ 🌐 Project Page β€’ πŸ“š Datasets

Introduction

SpiritSight-Agent is a vision-based, end-to-end GUI agent that excels in GUI navigation tasks across various GUI platforms.

Models

We recommend fine-tuning the base model on custom data.

Model Checkpoint Size License
SpiritSight-Agent-2B-base πŸ€— HF Link 2B InternVL
SpiritSight-Agent-8B-base πŸ€— HF Link 8B InternVL
SpiritSight-Agent-26B-base πŸ€— HF Link 26B InternVL

Datasets

Coming soon.

Inference

conda create -n spiritsight-agent python=3.9

pip install -r requirements.txt
pip install flash-attn==2.3.6 --no-build-isolation

python infer_SSAgent-26B.py

Citation

If you find this repo useful for your research, please kindly cite our paper:

@misc{huang2025spiritsightagentadvancedgui,
      title={SpiritSight Agent: Advanced GUI Agent with One Look}, 
      author={Zhiyuan Huang and Ziming Cheng and Junting Pan and Zhaohui Hou and Mingjie Zhan},
      year={2025},
      eprint={2503.03196},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.03196},
}

Acknowledgments

We thank the following amazing projects that truly inspired us: