Aria-UI commited on
Commit
8cd7246
·
verified ·
1 Parent(s): 1870680

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +152 -1
README.md CHANGED
@@ -27,4 +27,155 @@ Aria-UI sets new state-of-the-art results on offline and online agent benchmarks
27
  🏆 **1st place** on **AndroidWorld** with **44.8%** task success rate and
28
  🥉 **3rd place** on **OSWorld** with **15.2%** task success rate (Dec. 2024).
29
 
30
- ![Performance Spider Plot](assets/performance_spider.pdf)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  🏆 **1st place** on **AndroidWorld** with **44.8%** task success rate and
28
  🥉 **3rd place** on **OSWorld** with **15.2%** task success rate (Dec. 2024).
29
 
30
+ ![Performance Spider Plot](assets/performance_spider.pdf)
31
+
32
+ ## Quick Start
33
+ ### Installation
34
+ ```
35
+ pip install transformers==4.45.0 accelerate==0.34.1 sentencepiece==0.2.0 torchvision requests torch Pillow
36
+ pip install flash-attn --no-build-isolation
37
+ # For better inference performance, you can install grouped-gemm, which may take 3-5 minutes to install
38
+ pip install grouped_gemm==0.1.6
39
+ ```
40
+
41
+ ### Inference with vllm (strongly recommended)
42
+ First, make sure you install the latest version of vLLM so that it supports Aria-UI
43
+ ```
44
+ pip install https://vllm-wheels.s3.us-west-2.amazonaws.com/nightly/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
45
+ ```
46
+
47
+ Here is a code snippet for Aria-UI with vllm.
48
+ ```python
49
+ from PIL import Image, ImageDraw
50
+ from transformers import AutoTokenizer
51
+ from vllm import LLM, SamplingParams
52
+ import ast
53
+
54
+ model_path = "Aria-UI/Aria-UI-base"
55
+
56
+ def main():
57
+ llm = LLM(
58
+ model=model_path,
59
+ tokenizer_mode="slow",
60
+ dtype="bfloat16",
61
+ trust_remote_code=True,
62
+ )
63
+
64
+ tokenizer = AutoTokenizer.from_pretrained(
65
+ model_path, trust_remote_code=True, use_fast=False
66
+ )
67
+
68
+ instruction = "Try Aria."
69
+
70
+ messages = [
71
+ {
72
+ "role": "user",
73
+ "content": [
74
+ {"type": "image"},
75
+ {
76
+ "type": "text",
77
+ "text": "Given a GUI image, what are the relative (0-1000) pixel point coordinates for the element corresponding to the following instruction or description: " + instruction,
78
+ }
79
+ ],
80
+ }
81
+ ]
82
+
83
+ message = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
84
+
85
+ outputs = llm.generate(
86
+ {
87
+ "prompt_token_ids": message,
88
+ "multi_modal_data": {
89
+ "image": [
90
+ Image.open("examples/aria.png"),
91
+ ],
92
+ "max_image_size": 980, # [Optional] The max image patch size, default `980`
93
+ "split_image": True, # [Optional] whether to split the images, default `True`
94
+ },
95
+ },
96
+ sampling_params=SamplingParams(max_tokens=50, top_k=1, stop=["<|im_end|>"]),
97
+ )
98
+
99
+ for o in outputs:
100
+ generated_tokens = o.outputs[0].token_ids
101
+ response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
102
+ print(response)
103
+ coords = ast.literal_eval(response.replace("<|im_end|>", "").replace("```", "").replace(" ", "").strip())
104
+ return coords
105
+
106
+ if __name__ == "__main__":
107
+ main()
108
+
109
+ ```
110
+
111
+ ### Inference with Transfomrers (not recommended)
112
+ You can also use the original `transformers` API for Aria-UI. For instance:
113
+ ```python
114
+ import argparse
115
+ import torch
116
+ import os
117
+ import json
118
+ from tqdm import tqdm
119
+ import time
120
+ from PIL import Image, ImageDraw
121
+ from transformers import AutoModelForCausalLM, AutoProcessor
122
+ import ast
123
+
124
+ os.environ["CUDA_VISIBLE_DEVICES"] = "0"
125
+
126
+ model_path = "Aria-UI/Aria-UI-base"
127
+
128
+ model = AutoModelForCausalLM.from_pretrained(
129
+ model_path,
130
+ device_map="auto",
131
+ torch_dtype=torch.bfloat16,
132
+ trust_remote_code=True,
133
+ )
134
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
135
+
136
+ image_file = "./examples/aria.png"
137
+ instruction = "Try Aria."
138
+ image = Image.open(image_file).convert("RGB")
139
+
140
+ messages = [
141
+ {
142
+ "role": "user",
143
+ "content": [
144
+ {"text": None, "type": "image"},
145
+ {"text": instruction, "type": "text"},
146
+ ],
147
+ }
148
+ ]
149
+
150
+ text = processor.apply_chat_template(messages, add_generation_prompt=True)
151
+ inputs = processor(text=text, images=image, return_tensors="pt")
152
+ inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype)
153
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
154
+
155
+ with torch.inference_mode(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
156
+ output = model.generate(
157
+ **inputs,
158
+ max_new_tokens=50,
159
+ stop_strings=["<|im_end|>"],
160
+ tokenizer=processor.tokenizer,
161
+ # do_sample=True,
162
+ # temperature=0.9,
163
+ )
164
+
165
+ output_ids = output[0][inputs["input_ids"].shape[1] :]
166
+ response = processor.decode(output_ids, skip_special_tokens=True)
167
+ print(response)
168
+
169
+ coords = ast.literal_eval(response.replace("<|im_end|>", "").replace("```", "").replace(" ", "").strip())
170
+ ```
171
+
172
+ ## Citation
173
+ If you find our work helpful, please consider citing.
174
+ ```
175
+ @article{ariaui,
176
+ title={Aria-UI: Visual Grounding for GUI Instructions},
177
+ author={Yuhao Yang and Yue Wang and Dongxu Li and Ziyang Luo and Bei Chen and Chao Huang and Junnan Li},
178
+ year={2024},
179
+ journal={arXiv preprint arXiv:2412.16256},
180
+ }
181
+ ```