File size: 8,248 Bytes
36877d4 bb5408e 36877d4 1d01d95 7df4525 db1831e 240e3a9 87e6f59 1380629 296854f 5f86cd7 e30a42a 5406276 296854f d67a49b 5457ee3 a4ae4e7 c49837f a4ae4e7 c49837f c61a302 c49837f c61a302 c49837f c61a302 132ea2c 0482ffe 91dd55e 0482ffe 87b58f0 c61a302 1849123 c61a302 0482ffe 91dd55e 1849123 91dd55e 1849123 0482ffe f00bdc5 132ea2c b45f922 132ea2c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
---
license: llama2
pipeline_tag: image-text-to-text
---
# UGround (The Initial LLaVA-based Version)
**Update: We have trained [stronger models](https://huggingface.co/osunlp/UGround-V1-7B) based on Qwen2-VL with the same data. We suggest using them instead for better performance and more convenient training, inference and deployment.**
UGround is a strong GUI visual grounding model trained with a simple recipe. Check our homepage and paper for more details. This work is a collaboration between [OSUNLP](https://x.com/osunlp) and [Orby AI](https://www.orby.ai/).
![radar](https://osu-nlp-group.github.io/UGround/static/images/radar.png)
- **Homepage:** https://osu-nlp-group.github.io/UGround/
- **Repository:** https://github.com/OSU-NLP-Group/UGround
- **Paper:** https://arxiv.org/abs/2410.05243
- **Demo:** https://huggingface.co/spaces/orby-osu/UGround
- **Point of Contact:** [Boyu Gou](mailto:[email protected])
## Models
- Initial UGround-V1: https://huggingface.co/osunlp/UGround
- UGround-V1-2B (Qwen2-VL): https://huggingface.co/osunlp/UGround-V1-2B
- UGround-V1-7B (Qwen2-VL): https://huggingface.co/osunlp/UGround-V1-7B
- UGround-V1-72B (Qwen2-VL): Coming Soon
- UGround-V1.1-2B (Qwen2-VL): Coming Soon
- UGround-V1.1-7B (Qwen2-VL): Coming Soon
- UGround-V1.1-72B (Qwen2-VL): Coming Soon
## Release Plan
- [x] Model Weights
- [x] Initial V1 (the one used in the paper)
- [x] Qwen2-VL-based V1
- [x] 2B
- [x] 7B
- [ ] 72B
- [ ] V1.1
- [ ] Code
- [x] Inference Code of UGround
- [x] Offline Experiments
- [x] Screenspot (along with referring expressions generated by GPT-4/4o)
- [x] Multimodal-Mind2Web
- [x] OmniAct
- [ ] Android Control
- [ ] Online Experiments
- [ ] Mind2Web-Live-SeeAct-V
- [ ] AndroidWorld-SeeAct-V
- [ ] Data-V1
- [ ] Data Examples
- [ ] Data Construction Scripts
- [ ] Guidance of Open-source Data
- [ ] Data-V1.1
- [x] Online Demo (HF Spaces)
## Main Results
### GUI Visual Grounding: ScreenSpot (Standard Setting)
| Grounding Model | Arch | SFT data | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon | Avg |
| ---------------------------- | ---------------- | ---------------- | ----------- | ----------- | ------------ | ------------ | -------- | -------- | -------- |
| GPT-4 | | | 22.6 | 24.5 | 20.2 | 11.8 | 9.2 | 8.8 | 16.2 |
| GPT-4o | | | 20.2 | 24.9 | 21.1 | 23.6 | 12.2 | 7.8 | 18.3 |
| MiniGPT-v2 | MiniGPT-v2 | | 8.4 | 6.6 | 6.2 | 2.9 | 6.5 | 3.4 | 5.7 |
| Groma | Groma | | 10.3 | 2.6 | 4.6 | 4.3 | 5.7 | 3.4 | 5.2 |
| Fuyu | Fuyu | | 41.0 | 1.3 | 33.0 | 3.6 | 33.9 | 4.4 | 19.5 |
| Qwen-VL | Qwen-VL | | 9.5 | 4.8 | 5.7 | 5.0 | 3.5 | 2.4 | 5.2 |
| SeeClick | Qwen-VL | SeeClick | 78.0 | 52.0 | 72.2 | 30.0 | 55.7 | 32.5 | 53.4 |
| Qwen-GUI | Qwen-VL | GUICourse | 52.4 | 10.9 | 45.9 | 5.7 | 43.0 | 13.6 | 28.6 |
| **UGround-V1** | LLaVA-UGround-V1 | UGround-V1 | **82.8** | **60.3** | **82.5** | **63.6** | **80.4** | **70.4** | **73.3** |
| Qwen2-VL | Qwen2-VL | | 61.3 | 39.3 | 52.0 | 45.0 | 33.0 | 21.8 | 42.1 |
| Auguvis-G-7B | Qwen2-VL | Aguvis-Stage-1 | 88.3 | 78.2 | 88.1 | 70.7 | 85.7 | 74.8 | 81.0 |
| Auguvis-7B | Qwen2-VL | Aguvis-Stage-1&2 | **95.6** | 77.7 | **93.8** | 67.1 | 88.3 | 75.2 | 83.0 |
| OS-Atlas-Base-4B | InternVL | OS-Atlas | 85.7 | 58.5 | 72.2 | 45.7 | 82.6 | 63.1 | 68.0 |
| OS-Atlas-Base-7B | Qwen2-VL | OS-Atlas | 93.0 | 72.9 | 91.8 | 62.9 | **90.9** | 74.3 | 81.0 |
| ShowUI-G | ShowUI | ShowUI | 91.6 | 69.0 | 81.8 | 59.0 | 83.0 | 65.5 | 75.0 |
| ShowUI | ShowUI | ShowUI | 92.3 | 75.5 | 76.3 | 61.1 | 81.7 | 63.6 | 75.1 |
| Iris | Iris | SeeClick | 85.3 | 64.2 | 86.7 | 57.5 | 82.6 | 71.2 | 74.6 |
| Aria-UI | Aria | Aria-UI | 92.3 | 73.8 | 93.3 | 64.3 | 86.5 | 76.2 | 81.1 |
| **UGround-V1-2B (Qwen2-VL)** | Qwen2-VL | UGround-V1 | 89.4 | 72.0 | 88.7 | 65.7 | 81.3 | 68.9 | 77.7 |
| **UGround-V1-7B (Qwen2-VL)** | Qwen2-VL | UGround-V1 | 93.0 | **79.9** | **93.8** | **76.4** | **90.9** | **84.0** | **86.3** |
### GUI Visual Grounding: ScreenSpot (Agent Setting)
| Planner | Grounding Model | Arch | SFT data | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon | Avg |
| ------- | ------------------------ | ---------------- | ---------------- | ----------- | ----------- | ------------ | ------------ | -------- | -------- | -------- |
| GPT-4o | Qwen-VL | Qwen-VL | | 21.3 | 21.4 | 18.6 | 10.7 | 9.1 | 5.8 | 14.5 |
| GPT-4o | SeeClick | Qwen-VL | SeeClick | 81.0 | 59.8 | 69.6 | 33.6 | 43.9 | 26.2 | 52.4 |
| GPT-4o | Qwen-GUI | Qwen-VL | GUICourse | 67.8 | 24.5 | 53.1 | 16.4 | 50.4 | 18.5 | 38.5 |
| GPT-4o | **UGround-V1** | LLaVA-UGround-V1 | UGround-V1 | **93.4** | **76.9** | **92.8** | **67.9** | **88.7** | **68.9** | **81.4** |
| GPT-4o | OS-Atlas-Base-4B | InternVL | OS-Atlas | **94.1** | 73.8 | 77.8 | 47.1 | 86.5 | 65.3 | 74.1 |
| GPT-4o | OS-Atlas-Base-7B | Qwen2-VL | OS-Atlas | 93.8 | **79.9** | 90.2 | 66.4 | **92.6** | **79.1** | 83.7 |
| GPT-4o | **UGround-V1-2B (Qwen2-VL)** | Qwen2-VL | UGround-V1 | **94.1** | 77.7 | 92.8 | 63.6 | 90.0 | 70.9 | 81.5 |
| GPT-4o | **UGround-V1-7B (Qwen2-VL)** | Qwen2-VL | UGround-V1 | **94.1** | **79.9** | **93.3** | **73.6** | 89.6 | 73.3 | **84.0** |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6500870f1e14749e84f8f887/u5bXFxxAWCXthyXWyZkM4.png)
## Citation Information
If you find this work useful, please consider citing our papers:
```
@article{gou2024uground,
title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
journal={arXiv preprint arXiv:2410.05243},
year={2024},
url={https://arxiv.org/abs/2410.05243},
}
@article{zheng2023seeact,
title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
journal={arXiv preprint arXiv:2401.01614},
year={2024},
}
```
|