AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents
Abstract
Autonomous agents have become increasingly important for interacting with the real world. Android agents, in particular, have been recently a frequently-mentioned interaction method. However, existing studies for training and evaluating Android agents lack systematic research on both open-source and closed-source models. In this work, we propose AndroidLab as a systematic Android agent framework. It includes an operation environment with different modalities, action space, and a reproducible benchmark. It supports both large language models (LLMs) and multimodal models (LMMs) in the same action space. AndroidLab benchmark includes predefined Android virtual devices and 138 tasks across nine apps built on these devices. By using the AndroidLab environment, we develop an Android Instruction dataset and train six open-source LLMs and LMMs, lifting the average success rates from 4.59% to 21.50% for LLMs and from 1.93% to 13.28% for LMMs. AndroidLab is open-sourced and publicly available at https://github.com/THUDM/Android-Lab.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation (2024)
- Lightweight Neural App Control (2024)
- ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents (2024)
- OS-ATLAS: A Foundation Action Model for Generalist GUI Agents (2024)
- MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
My read of this paper:
๐๐ป๐ฑ๐ฟ๐ผ๐ถ๐ฑ๐๐ฎ๐ฏ: ๐๐ถ๐ฟ๐๐ ๐ฒ๐๐ฒ๐ฟ ๐๐๐๐๐ฒ๐บ๐ฎ๐๐ถ๐ฐ ๐ฏ๐ฒ๐ป๐ฐ๐ต๐บ๐ฎ๐ฟ๐ธ ๐ณ๐ผ๐ฟ ๐๐ป๐ฑ๐ฟ๐ผ๐ถ๐ฑ ๐บ๐ผ๐ฏ๐ถ๐น๐ฒ ๐ฎ๐ด๐ฒ๐ป๐๐ ๐๐ต๐ผ๐๐ ๐๐ต๐ฎ๐ ๐๐บ๐ฎ๐น๐น, ๐ณ๐ถ๐ป๐ฒ-๐๐๐ป๐ฒ๐ฑ ๐ผ๐ฝ๐ฒ๐ป ๐บ๐ผ๐ฑ๐ฒ๐น๐ ๐ฐ๐ฎ๐ป ๐ฝ๐ผ๐๐ฒ๐ฟ ๐ฎ ๐๐๐ฅ๐ฉ๐๐ฆ ๐๐๐๐๐ฒ๐บ ๐ผ๐ป ๐๐ผ๐๐ฟ ๐๐บ๐ฎ๐ฟ๐๐ฝ๐ต๐ผ๐ป๐ฒ ๐ฑ๐ฅ
A team from Tsinghua University just released AndroidLab, the first systematic framework to evaluate and train Android mobile agents that works with both text-only and multimodal models.
They show that fine-tuning small open-source models can significantly boost performance, matching that of much bigger closed models like GPT-4o.
The team built:
๐ A reproducible benchmark with 138 tasks across 9 apps to evaluate mobile agents systematically
๐๐ฑ A framework supporting both text-only (via XML) and visual (via marked screenshots) interfaces
โ An instruction dataset of 10.5k operation traces for training mobile agents
Key insights:
- ๐ Fine-tuning improves performance BY A LOT: Open-source model Llama-3.1-8B improves from 2% to 24% success rate after training, nearly reaching GPT-4o performance although itโs much smaller
- โ๏ธ Text-only agents match multimodal ones: XML-based agents achieve similar performance to screenshot-based multimodal agents.
Congrats for this great work ๐ค
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper