arxiv:2410.24024

AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents

Published on Oct 31, 2024

· Submitted by

ShawLiu on Nov 5, 2024

#2 Paper of the day

Upvote

Authors:

Xiao Liu ,

Xueqiao Sun ,

Siyi Cheng ,

Hanyu Lai ,

Shudan Zhang ,

Jie Tang ,

Yuxiao Dong

Abstract

Autonomous agents have become increasingly important for interacting with the real world. Android agents, in particular, have been recently a frequently-mentioned interaction method. However, existing studies for training and evaluating Android agents lack systematic research on both open-source and closed-source models. In this work, we propose AndroidLab as a systematic Android agent framework. It includes an operation environment with different modalities, action space, and a reproducible benchmark. It supports both large language models (LLMs) and multimodal models (LMMs) in the same action space. AndroidLab benchmark includes predefined Android virtual devices and 138 tasks across nine apps built on these devices. By using the AndroidLab environment, we develop an Android Instruction dataset and train six open-source LLMs and LMMs, lifting the average success rates from 4.59% to 21.50% for LLMs and from 1.93% to 13.28% for LMMs. AndroidLab is open-sourced and publicly available at https://github.com/THUDM/Android-Lab.

View arXiv page View PDF Add to collection

Community

ShawLiu

Paper author Paper submitter Nov 5, 2024

https://github.com/THUDM/Android-Lab

librarian-bot

Nov 6, 2024

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

m-ric

Nov 8, 2024

•

edited Nov 8, 2024

My read of this paper:

𝗔𝗻𝗱𝗿𝗼𝗶𝗱𝗟𝗮𝗯: 𝗙𝗶𝗿𝘀𝘁 𝗲𝘃𝗲𝗿 𝘀𝘆𝘀𝘁𝗲𝗺𝗮𝘁𝗶𝗰 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸 𝗳𝗼𝗿 𝗔𝗻𝗱𝗿𝗼𝗶𝗱 𝗺𝗼𝗯𝗶𝗹𝗲 𝗮𝗴𝗲𝗻𝘁𝘀 𝘀𝗵𝗼𝘄𝘀 𝘁𝗵𝗮𝘁 𝘀𝗺𝗮𝗹𝗹, 𝗳𝗶𝗻𝗲-𝘁𝘂𝗻𝗲𝗱 𝗼𝗽𝗲𝗻 𝗺𝗼𝗱𝗲𝗹𝘀 𝗰𝗮𝗻 𝗽𝗼𝘄𝗲𝗿 𝗮 𝗝𝗔𝗥𝗩𝗜𝗦 𝘀𝘆𝘀𝘁𝗲𝗺 𝗼𝗻 𝘆𝗼𝘂𝗿 𝘀𝗺𝗮𝗿𝘁𝗽𝗵𝗼𝗻𝗲 📱🔥

A team from Tsinghua University just released AndroidLab, the first systematic framework to evaluate and train Android mobile agents that works with both text-only and multimodal models.

They show that fine-tuning small open-source models can significantly boost performance, matching that of much bigger closed models like GPT-4o.

The team built:

📊 A reproducible benchmark with 138 tasks across 9 apps to evaluate mobile agents systematically

📝📱 A framework supporting both text-only (via XML) and visual (via marked screenshots) interfaces

✅ An instruction dataset of 10.5k operation traces for training mobile agents

Key insights:

📈 Fine-tuning improves performance BY A LOT: Open-source model Llama-3.1-8B improves from 2% to 24% success rate after training, nearly reaching GPT-4o performance although it’s much smaller
⚙️ Text-only agents match multimodal ones: XML-based agents achieve similar performance to screenshot-based multimodal agents.

Congrats for this great work 🤗

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.24024 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.24024 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.24024 in a Space README.md to link it from this page.