Safetensors

SpiritSight Agent: Advanced GUI Agent with One Look

πŸ“„ Paper β€’ πŸ€– Models β€’ πŸ“š Datasets (Coming soon…)

Introduction

SpiritSight-Agent is a vision-based, end-to-end GUI agent that excels in GUI navigation tasks across various GUI platforms.

Models

We recommend fine-tuning the base model on custom data.

Model Checkpoint Size License
SpiritSight-Agent-2B-base πŸ€— HF Link 2B InternVL
SpiritSight-Agent-8B-base πŸ€— HF Link 8B InternVL
SpiritSight-Agent-26B-base πŸ€— HF Link 26B InternVL

Datasets

Coming soon.

Inference

conda create -n spiritsight-agent python=3.9

pip install -r requirements.txt
pip install flash-attn==2.3.6 --no-build-isolation

python infer_SSAgent-26B.py

Citation

If you find this repo useful for your research, please kindly cite our paper:

@misc{huang2025spiritsightagentadvancedgui,
      title={SpiritSight Agent: Advanced GUI Agent with One Look}, 
      author={Zhiyuan Huang and Ziming Cheng and Junting Pan and Zhaohui Hou and Mingjie Zhan},
      year={2025},
      eprint={2503.03196},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.03196},
}

Acknowledgments

We thank the following amazing projects that truly inspired us:

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including SenseLLM/SpiritSight-Agent-26B