Image-Text-to-Text
Safetensors
llava_llama
File size: 8,248 Bytes
36877d4
 
bb5408e
36877d4
 
1d01d95
7df4525
db1831e
240e3a9
87e6f59
1380629
296854f
 
5f86cd7
e30a42a
5406276
296854f
d67a49b
 
 
 
 
 
 
 
 
 
 
 
 
5457ee3
a4ae4e7
 
 
 
 
 
c49837f
a4ae4e7
c49837f
 
 
 
c61a302
c49837f
c61a302
 
 
c49837f
 
c61a302
 
132ea2c
 
0482ffe
 
 
 
91dd55e
0482ffe
87b58f0
c61a302
 
 
 
 
 
 
 
 
1849123
c61a302
 
 
 
 
 
 
 
 
 
 
0482ffe
91dd55e
 
 
 
 
 
 
1849123
91dd55e
 
1849123
 
 
 
0482ffe
 
 
f00bdc5
 
132ea2c
 
b45f922
132ea2c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
license: llama2
pipeline_tag: image-text-to-text
---

# UGround (The Initial LLaVA-based Version)

**Update: We have trained [stronger models](https://huggingface.co/osunlp/UGround-V1-7B) based on Qwen2-VL with the same data. We suggest using them instead for better performance and more convenient training, inference and deployment.**

UGround is a strong GUI visual grounding model trained with a simple recipe. Check our homepage and paper for more details. This work is a collaboration between [OSUNLP](https://x.com/osunlp) and [Orby AI](https://www.orby.ai/).
![radar](https://osu-nlp-group.github.io/UGround/static/images/radar.png)
- **Homepage:** https://osu-nlp-group.github.io/UGround/
- **Repository:** https://github.com/OSU-NLP-Group/UGround
- **Paper:** https://arxiv.org/abs/2410.05243
- **Demo:** https://huggingface.co/spaces/orby-osu/UGround
- **Point of Contact:** [Boyu Gou](mailto:[email protected])


## Models

- Initial UGround-V1: https://huggingface.co/osunlp/UGround
- UGround-V1-2B (Qwen2-VL): https://huggingface.co/osunlp/UGround-V1-2B
- UGround-V1-7B (Qwen2-VL): https://huggingface.co/osunlp/UGround-V1-7B
- UGround-V1-72B (Qwen2-VL): Coming Soon
- UGround-V1.1-2B (Qwen2-VL): Coming Soon
- UGround-V1.1-7B (Qwen2-VL): Coming Soon
- UGround-V1.1-72B (Qwen2-VL): Coming Soon

## Release Plan

- [x] Model Weights
  - [x] Initial V1 (the one used in the paper)
  - [x] Qwen2-VL-based V1
    - [x] 2B
    - [x] 7B
    - [ ] 72B
  - [ ] V1.1
- [ ] Code
  - [x] Inference Code of UGround
  - [x] Offline Experiments
    - [x] Screenspot (along with referring expressions generated by GPT-4/4o)
    - [x] Multimodal-Mind2Web
    - [x] OmniAct
    - [ ] Android Control
  - [ ] Online Experiments
    - [ ] Mind2Web-Live-SeeAct-V
    - [ ] AndroidWorld-SeeAct-V
- [ ] Data-V1
  - [ ] Data Examples
  - [ ] Data Construction Scripts
  - [ ] Guidance of Open-source Data
- [ ] Data-V1.1
- [x] Online Demo (HF Spaces)



## Main Results

### GUI Visual Grounding: ScreenSpot (Standard Setting)

| Grounding Model       | Arch             | SFT data         | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon | Avg      |
| ---------------------------- | ---------------- | ---------------- | ----------- | ----------- | ------------ | ------------ | -------- | -------- | -------- |
| GPT-4                        |                  |                  | 22.6        | 24.5        | 20.2         | 11.8         | 9.2      | 8.8      | 16.2     |
| GPT-4o                       |                  |                  | 20.2        | 24.9        | 21.1         | 23.6         | 12.2     | 7.8      | 18.3     |
| MiniGPT-v2                   | MiniGPT-v2       |                  | 8.4         | 6.6         | 6.2          | 2.9          | 6.5      | 3.4      | 5.7      |
| Groma                        | Groma            |                  | 10.3        | 2.6         | 4.6          | 4.3          | 5.7      | 3.4      | 5.2      |
| Fuyu                         | Fuyu             |                  | 41.0        | 1.3         | 33.0         | 3.6          | 33.9     | 4.4      | 19.5     |
| Qwen-VL                      | Qwen-VL          |                  | 9.5         | 4.8         | 5.7          | 5.0          | 3.5      | 2.4      | 5.2      |
| SeeClick                     | Qwen-VL          | SeeClick         | 78.0        | 52.0        | 72.2         | 30.0         | 55.7     | 32.5     | 53.4     |
| Qwen-GUI                     | Qwen-VL          | GUICourse        | 52.4        | 10.9        | 45.9         | 5.7          | 43.0     | 13.6     | 28.6     |
| **UGround-V1**               | LLaVA-UGround-V1 | UGround-V1       | **82.8**        | **60.3**        | **82.5**         | **63.6**         | **80.4**     | **70.4**     | **73.3**     |
| Qwen2-VL                     | Qwen2-VL         |                  | 61.3        | 39.3        | 52.0         | 45.0         | 33.0     | 21.8     | 42.1     |
| Auguvis-G-7B                 | Qwen2-VL         | Aguvis-Stage-1   | 88.3        | 78.2        | 88.1         | 70.7         | 85.7     | 74.8     | 81.0     |
| Auguvis-7B                   | Qwen2-VL         | Aguvis-Stage-1&2 | **95.6**    | 77.7        | **93.8**     | 67.1         | 88.3     | 75.2     | 83.0     |
| OS-Atlas-Base-4B             | InternVL         | OS-Atlas         | 85.7        | 58.5        | 72.2         | 45.7         | 82.6     | 63.1     | 68.0     |
| OS-Atlas-Base-7B             | Qwen2-VL         | OS-Atlas         | 93.0        | 72.9        | 91.8         | 62.9         | **90.9** | 74.3     | 81.0     |
| ShowUI-G                     | ShowUI           | ShowUI           | 91.6        | 69.0        | 81.8         | 59.0         | 83.0     | 65.5     | 75.0     |
| ShowUI                       | ShowUI           | ShowUI           | 92.3        | 75.5        | 76.3         | 61.1         | 81.7     | 63.6     | 75.1     |
| Iris                         | Iris             | SeeClick         | 85.3        | 64.2        | 86.7         | 57.5         | 82.6     | 71.2     | 74.6     |
| Aria-UI                      | Aria             | Aria-UI          | 92.3        | 73.8        | 93.3         | 64.3         | 86.5     | 76.2     | 81.1     |
| **UGround-V1-2B (Qwen2-VL)** | Qwen2-VL         | UGround-V1       | 89.4        | 72.0        | 88.7         | 65.7         | 81.3     | 68.9     | 77.7     |
| **UGround-V1-7B (Qwen2-VL)** | Qwen2-VL         | UGround-V1       | 93.0        | **79.9**    | **93.8**     | **76.4**     | **90.9** | **84.0** | **86.3** |

### GUI Visual Grounding: ScreenSpot (Agent Setting)

| Planner | Grounding Model          | Arch             | SFT data         | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon | Avg      |
| ------- | ------------------------ | ---------------- | ---------------- | ----------- | ----------- | ------------ | ------------ | -------- | -------- | -------- |
| GPT-4o  | Qwen-VL                  | Qwen-VL          |                  | 21.3        | 21.4        | 18.6         | 10.7         | 9.1      | 5.8      | 14.5     |
| GPT-4o  | SeeClick                 | Qwen-VL          | SeeClick         | 81.0        | 59.8        | 69.6         | 33.6         | 43.9     | 26.2     | 52.4     |
| GPT-4o  | Qwen-GUI                 | Qwen-VL          | GUICourse        | 67.8        | 24.5        | 53.1         | 16.4         | 50.4     | 18.5     | 38.5     |
| GPT-4o  | **UGround-V1**               | LLaVA-UGround-V1 | UGround-V1       | **93.4**        | **76.9**        | **92.8**         | **67.9**         | **88.7**     | **68.9**     | **81.4**     |
| GPT-4o  | OS-Atlas-Base-4B         | InternVL         | OS-Atlas         | **94.1**    | 73.8        | 77.8         | 47.1         | 86.5     | 65.3     | 74.1     |
| GPT-4o  | OS-Atlas-Base-7B         | Qwen2-VL         | OS-Atlas         | 93.8        | **79.9**    | 90.2         | 66.4         | **92.6** | **79.1** | 83.7     |
| GPT-4o  | **UGround-V1-2B (Qwen2-VL)** | Qwen2-VL         | UGround-V1       | **94.1**    | 77.7        | 92.8         | 63.6         | 90.0     | 70.9     | 81.5     |
| GPT-4o  | **UGround-V1-7B (Qwen2-VL)** | Qwen2-VL         | UGround-V1       | **94.1**    | **79.9**    | **93.3**     | **73.6**     | 89.6     | 73.3     | **84.0** |





![image/png](https://cdn-uploads.huggingface.co/production/uploads/6500870f1e14749e84f8f887/u5bXFxxAWCXthyXWyZkM4.png)

## Citation Information

If you find this work useful, please consider citing our papers: 

```
@article{gou2024uground,
        title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
        author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2410.05243},
        year={2024},
        url={https://arxiv.org/abs/2410.05243},
      }

@article{zheng2023seeact,
        title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
        author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2401.01614},
        year={2024},
      }
```