This repository contains the model of the paper OS-ATLAS: A Foundation Action Model for Generalist GUI Agents.
Quick Start
OS-Atlas-Base-7B is a GUI grounding model finetuned from Qwen2-VL-7B-Instruct.
Notes: Our models accept images of any size as input. The model outputs are normalized to relative coordinates within a 0-1000 range (either a center point or a bounding box defined by top-left and bottom-right coordinates). For visualization, please remember to convert these relative coordinates back to the original image dimensions.
Inference Example
First, ensure that the necessary dependencies are installed:
pip install transformers
pip install qwen-vl-utils
Inference code example:
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
# Default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
"OS-Copilot/OS-Atlas-Base-7B", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("OS-Copilot/OS-Atlas-Base-7B")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://github.com/OS-Copilot/OS-Atlas/blob/main/exmaples/images/web_6f93090a-81f6-489e-bb35-1a2838b18c01.png",
},
{"type": "text", "text": "In this UI screenshot, what is the position of the element corresponding to the command \"switch language of current page\" (with bbox)?"},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
)
print(output_text)
# <|object_ref_start|>language switch<|object_ref_end|><|box_start|>(576,12),(592,42)<|box_end|><|im_end|>
Citation
If you find this repository helpful, feel free to cite our paper:
@article{wu2024atlas,
title={OS-ATLAS: A Foundation Action Model for Generalist GUI Agents},
author={Wu, Zhiyong and Wu, Zhenyu and Xu, Fangzhi and Wang, Yian and Sun, Qiushi and Jia, Chengyou and Cheng, Kanzhi and Ding, Zichen and Chen, Liheng and Liang, Paul Pu and others},
journal={arXiv preprint arXiv:2410.23218},
year={2024}
}
- Downloads last month
- 1,346
Inference API (serverless) does not yet support transformers models for this pipeline type.
Model tree for OS-Copilot/OS-Atlas-Base-7B
Base model
Qwen/Qwen2-VL-7B-InstructSpace using OS-Copilot/OS-Atlas-Base-7B 1
Collection including OS-Copilot/OS-Atlas-Base-7B
Collection
OS-Atlas series models
•
8 items
•
Updated
•
11