LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects

William Liu; Liang Liu; Yaxuan Guo; Han Xiao; Weifeng Lin; Yuxiang Chai; Shuai Ren; Xiaoyu Liang; Linghao Li; Wenhao Wang; Tianze Wu; Yong Liu; Hao Wang; Hongsheng Li; Guanjing Xiong

doi:10.20944/preprints202501.0413.v1

Submitted:

05 January 2025

Posted:

06 January 2025

You are already at the latest version

Abstract

With the rapid rise of large language models (LLMs), phone automation has undergone transformative changes. This paper systematically reviews LLM-driven phone GUI agents, highlighting their evolution from script-based automation to intelligent, adaptive systems. We first contextualize key challenges, (i) limited generality, (ii) high maintenance overhead, and (iii) weak intent comprehension, and show how LLMs address these issues through advanced language understanding, multimodal perception, and robust decision-making. We then propose a taxonomy covering fundamental agent frameworks (single-agent, multi-agent, plan-then-act), modeling approaches (prompt engineering, training-based), and essential datasets and benchmarks. Furthermore, we detail task-specific architectures, supervised fine-tuning, and reinforcement learning strategies that bridge user intent and GUI operations. Finally, we discuss open challenges such as dataset diversity, on-device deployment efficiency, user-centric adaptation, and security concerns, offering forward-looking insights into this rapidly evolving field. By providing a structured overview and identifying pressing research gaps, this paper serves as a definitive reference for researchers and practitioners seeking to harness LLMs in designing scalable, user-friendly phone GUI agents. Project Homepage: github.com/PhoneLLM/Awesome-LLM-Powered-Phone-GUI-Agents

Keywords:

phone automation

;

large language models

;

GUI agents

;

traditional methods

;

LLMs in phone automation

;

frameworks

;

perception

;

brain

;

action

;

multi-agent framework

;

Plan-Then-Act Framework

;

prompt engineering

;

training-based methods

;

datasets

;

benchmarks

;

challenges

;

future directions

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

The core of phone GUI automation is to simulate human interactions with phone interfaces programmatically, thereby accomplishing a series of complex tasks. Phone automation is widely applied in areas such as application testing and shortcut instructions, aiming to enhance operational efficiency or free up human resource Azim and Neamtiu (2013); Degott et al. (2019); Koroglu et al. (2018); Li et al. (2019); Pan et al. (2020). Traditional phone automation often relies on predefined scripts and templates, which, while effective, tend to be rigid and inflexible when facing complex and variable user interfaces and dynamic environment Arnatovich and Wang (2018); Deshmukh et al. (2023); Nass (2024); Nass et al. (2021); Tramontana et al. (2019). These methods can be viewed as early forms of agents, designed to perform specific tasks in a predetermined manner.

An agent, in the context of computer science and artificial intelligence, is an entity that perceives its environment through sensors and acts upon that environment through actuators to achieve specific goals Guo et al. (2024); Jin et al. (2024); Li et al. (2024c); Wang et al. (2024c). Agents can range from simple scripts that execute fixed sequences of actions to complex systems capable of learning, reasoning, and adapting to new situations Huang et al. (2024); Jin et al. (2024); Wang et al. (2024c). Traditional agents in phone automation are limited by their reliance on static scripts and lack of adaptability, making it challenging for them to handle the dynamic and complex nature of modern mobile interfaces.

Building intelligent autonomous agents with abilities in task planning, decision-making, and action execution has been a long-term goal of artificial intelligence Albrecht and Stone (2018). As artificial intelligence technologies have advanced, the development of agents has progressed from these traditional agents Anscombe (2000); Dennett (1988); Shoham (1993) to AI agents Gao et al. (2018); Inkster et al. (2018); Poole and Mackworth (2010) that incorporate machine learning and decision-making capabilities. These AI agents can learn from data, make decisions based on probabilistic models, and adapt to changes in the environment to some extent. However, they still face limitations in understanding complex user instructions Amershi et al. (2014); Luger and Sellen (2016) and managing highly dynamic environment Christiano et al. (2017); Köhl et al. (2019).

With the rapid development of LLMs like the GPT series Achiam et al. (2023); Brown (2020); Radford (2018a); Radford et al. (2019), agents based on these models have exhibited powerful capabilities in numerous fields Boiko et al. (2023); Dasgupta et al. (2023); Dong et al. (2024); Hong et al. (2023); Li et al. (2023a); Park et al. (2023); Qian et al. (2023, 2024a); Wang et al. (2023c); Xia et al. (2023). As illustrated in Figure 1, there are key differences between conversational LLMs and LLM-based agents. While conversational LLMs primarily focus on understanding and generating human language—engaging in dialogue, answering questions, summarizing information, and translating language—LLM-based agents extend beyond these capabilities by integrating perception and action components. This integration enables them to interact with the external environment through multimodal inputs, such as visual data from user interfaces, and perform actions that alter environmental states Hong et al. (2023); Qian et al. (2024a); Wang et al. (2023c). By combining perception, reasoning, and action, these agents can parse intricate instructions, formulate operational commands, and autonomously perform highly complex tasks, bridging the gap between language understanding and real-world interactions Guo et al. (2024); Li et al. (2024c); Xi et al. (2023).

Applying LLM-based agents to phone automation has brought a new paradigm to traditional automation, making operations on phone interfaces more intelligent Hong et al. (2024); Song et al. (2023b); Zhang et al. (2023a); Zheng et al. (2024). LLM-powered phone GUI agents are intelligent systems that leverage large language models to understand, plan, and execute tasks on mobile devices by integrating natural language processing, multimodal perception, and action execution capabilities. On smartphones, these agents can recognize and analyze user interfaces, understand natural language instructions, perceive interface changes in real time, and respond dynamically. Unlike traditional script-based automation that relies on coding fixed operation paths, these agents can autonomously plan complex task sequences through multimodal processing of language instructions and interface information. They have strong adaptability and flexible pathways, greatly improving user experience by understanding human intentions, performing complex long-chain planning, and executing tasks automatically, thereby improving efficiency in a wide range of scenarios, including not only phone automated testing but also executing complex tasks such as configuring intricate phone settings Wen et al. (2024), navigating maps Wang et al. (2024a,b), and facilitating online shopping Zhang et al. (2023a).

Clarifying the development trajectory of phone GUI agents is crucial. On one hand, with the support of large language models Achiam et al. (2023); Brown (2020); Radford (2018a); Radford et al. (2019), phone GUI agents can significantly enhance the efficiency of phone automation scenarios, making operations more intelligent and no longer limited to coding fixed operation paths. This enhancement not only optimizes phone automation processes but also expands the application scope of automation. On the other hand, phone GUI agents can understand and execute complex natural language instructions, transforming human intentions into specific operations such as automatically scheduling appointments, booking restaurants, summoning transportation, and even achieving functionalities similar to autonomous driving in advanced automation. These capabilities demonstrate the potential of phone GUI agents in executing complex tasks, providing convenience to users and laying practical foundations for AI development.

With the increasing research on large language models in phone automation Liu et al. (2024c); Lu et al. (2024b); Wang et al. (2024a,b); Wen et al. (2024, 2023); Zhang et al. (2024b), the research community’s attention to this field has grown rapidly. However, there is still a lack of dedicated systematic surveys in this area, especially comprehensive explorations of phone automation from the perspective of large language models. Given the importance of phone GUI agents, the purpose of this paper is to fill this gap by systematically summarizing current research achievements, reviewing relevant literature, analyzing the application status of large language models in phone automation, and pointing out directions for future research.

To provide a comprehensive overview of the current state and future prospects of LLM-Powered GUI Agents in Phone Automation, we present a taxonomy that categorizes the field into three main areas: Frameworks of LLM-powered phone GUI agents, Large Language Models for Phone Automation, and Datasets and Evaluation Methods Figure 2. This taxonomy highlights the diversity and complexity of the field, as well as the interdisciplinary nature of the research involved.

Unlike previous literature reviews, which primarily focus on traditional phone automated testing methods, most existing surveys emphasize manual scripting or rule-based automation approaches without leveraging LLMs Arnatovich and Wang (2018); Deshmukh et al. (2023); Nass (2024); Nass et al. (2021); Tramontana et al. (2019). These traditional methods face significant challenges in coping with dynamic changes, complex user interfaces, and the scalability required for modern applications. Although recent surveys have explored broader areas of multimodal agents and foundation models for GUI automation, such as Foundations and Recent Trends in Multimodal Mobile Agents: A Survey Wu et al. (2024), GUI Agents with Foundation Models: A Comprehensive Survey Wang et al. (2024f), and Large Language Model-Brained GUI Agents: A Survey Zhang et al. (2024a), these works primarily cover general GUI-based automation and multimodal applications.

However, a dedicated and focused survey on the role of large language models in phone GUI automation remains absent in the existing literature. This paper addresses the above-mentioned gap by systematically reviewing the latest developments, challenges, and opportunities in LLM-powered phone GUI agents, thereby offering a more targeted exploration of this emerging domain. Our main contributions can be summarized as follows:

A Comprehensive and Systematic Survey of LLM-Powered Phone GUI Agents. We provide an in-depth and structured overview of recent literature on LLM-powered phone automation, examining its developmental trajectory, core technologies, and real-world application scenarios. By comparing LLM-driven methods to traditional phone automation approaches, this survey clarifies how large models transform GUI-based tasks and enable more intelligent, adaptive interaction paradigms.
Methodological Framework from Multiple Perspectives. Leveraging insights from existing studies, we propose a unified methodology for designing LLM-driven phone GUI agents. This encompasses framework design (e.g., single-agent vs. multi-agent vs. plan-then-act frameworks), LLM model selection and training (prompt engineering vs. training-based methods), data collection and preparation strategies (GUI-specific datasets and annotations), and evaluation protocols (benchmarks and metrics). Our systematic taxonomy and method-oriented discussion serve as practical guidelines for both academic and industrial practitioners.
In-Depth Analysis of Why LLMs Empower Phone Automation. We delve into the fundamental reasons behind LLMs’ capacity to enhance phone automation. By detailing their advancements in natural language comprehension, multimodal grounding, reasoning, and decision-making, we illustrate how LLMs bridge the gap between user intent and GUI actions. This analysis elucidates the critical role of large models in tackling issues of scalability, adaptability, and human-like interaction in real-world mobile environment.
Insights into Latest Developments, Datasets, and Benchmarks. We introduce and evaluate the most recent progress in the field, highlighting innovative datasets that capture the complexity of modern GUIs and benchmarks that allow reliable performance assessment. These resources form the backbone of LLM-based phone automation, enabling systematic training, fair evaluation, and transparent comparisons across different agent designs.
Identification of Key Challenges and Novel Perspectives for Future Research. Beyond discussing mainstream hurdles (e.g., dataset coverage, on-device constraints, reliability), we propose forward-looking viewpoints on user-centric adaptations, security and privacy considerations, long-horizon planning, and multi-agent coordination. These novel perspectives shed light on how researchers and developers might advance the current state of the art toward more robust, secure, and personalized phone GUI agents.

By addressing these aspects, our survey not only provides an up-to-date map of LLM-powered phone GUI automation but also offers a clear roadmap for future exploration. We hope this work will guide researchers in identifying pressing open problems and inform practitioners about promising directions to harness LLMs in designing efficient, adaptive, and user-friendly phone GUI agents.

2. Development of Phone Automation

The evolution of phone automation has been marked by significant technological advancements Kong et al. (2018), particularly with the emergence of LLMs Achiam et al. (2023); Brown (2020); Radford (2018a); Radford et al. (2019). This section explores the historical development of phone automation, the challenges faced by traditional methods, and how LLMs have revolutionized the field.

2.1. Phone Automation Before the LLM Era

Before the advent of LLMs, phone automation was predominantly achieved through traditional technical methods Amalfitano et al. (2014); Azim and Neamtiu (2013); Kirubakaran and Karthikeyani (2013); Kong et al. (2018); Linares-Vásquez et al. (2017); Zhao et al. (2024). This subsection delves into the primary areas of research and application during that period, including automation testing, shortcuts, and Robotic Process Automation (RPA), highlighting their methodologies and limitations.

2.1.1. Automation Testing

Phone applications (apps) have become extremely popular, with approximately 1.68 million apps in the Google Play Store1. The increasing complexity of apps Hecht et al. (2015) has raised significant concerns about app quality. Moreover, due to rapid release cycles and limited human resources, developers find it challenging to manually construct test cases. Therefore, various automated phone app testing techniques have been developed and applied, making phone automation testing the main application of phone automation before the era of large models Kirubakaran and Karthikeyani (2013); Kong et al. (2018); Linares-Vásquez et al. (2017); Zein et al. (2016). Test cases for phone apps are typically represented by a sequence of GUI events Jensen et al. (2013) to simulate user interactions with the app. The goal of automated test generators is to produce such event sequences to achieve high code coverage or detect bugs Zhao et al. (2024).

In the development history of phone automation testing, we have witnessed several key breakthroughs and advancements. Initially, random testing (e.g., Monkey Testing Machiry et al. (2013)) was used as a simple and fundamental testing method, detecting application stability and robustness by randomly generating user actions. Although this method could cover a wide range of operational scenarios, its testing process lacked focus and was difficult to reproduce and pinpoint specific issues Kong et al. (2018).

Subsequently, model-based testing Amalfitano et al. (2012, 2014); Azim andNeamtiu (2013) became a more systematic testing approach. It establishes a user interface model of the application, using predefined states and transition rules to generate test cases. This method improved testing coverage and efficiency, but the construction and maintenance of the model required substantial manual involvement, and updating the model became a challenge for highly dynamic applications.

With the development of machine learning techniques, learning-based testing methods began to emerge Degott et al. (2019); Koroglu et al. (2018); Li et al. (2019); Pan et al. (2020). These methods generate test cases by analyzing historical data to learn user behavior patterns. For example, Humanoid Li et al. (2019) uses deep learning to mimic human tester interaction behavior and uses the learned model to guide test generation like a human tester. However, this method relies on human-generated datasets to train the model and needs to combine the model with a set of predefined rules to guide testing.

Recently, reinforcement learning Ladosz et al. (2022) has shown great potential in the field of automated testing. DinoDroid Zhao et al. (2024) is an example that uses Deep Q-Network (DQN) Fan et al. (2020) to automate testing of Android applications. By learning behavior models of existing applications, it automatically explores and generates test cases, not only improving code coverage but also enhancing bug detection capabilities. Deep reinforcement learning methods can handle more complex state spaces and make more intelligent decisions but also face challenges such as high training costs and poor model generalization capabilities Luo et al. (2024).

2.1.2. Shortcuts

Shortcuts on mobile devices refer to predefined rules or trigger conditions that enable users to execute a series of actions automatically Bridle and McCreath (2006); Guerreiro et al. (2008); Kennedy and Everett (2011). These shortcuts are designed to streamline interaction by reducing repetitive manual input. For instance, the Tasker app on the Android platform2 and the Shortcuts feature on iOS3 allow users to automate tasks like turning on Wi-Fi, sending text messages, or launching apps under specific conditions such as time, location, or events. These implementations leverage simple IF-THEN and manually-designed logic but are inherently limited in scope and flexibility.

2.1.3. Robotic Process Automation

Robotic Process Automation(RPA) applications on phone devices aim to simulate human users performing repetitive tasks across applications Agostinelli et al. (2019). Phone RPA tools generate repeatable automation processes by recording user action sequences. These tools are used in enterprise environment to automate tasks such as data entry and information gathering, reducing human errors and improving efficiency, but they struggle with dynamic interfaces and require frequent script updates Pramod (2022); Syed et al. (2020).

2.2. Challenges of Traditional Methods

Despite the advancements made, traditional phone automation methods faced significant challenges that hindered further development. This subsection analyzes these challenges, including lack of generality and flexibility, high maintenance costs, difficulty in understanding complex user intentions, and insufficient intelligent perception, highlighting the need for new approaches.

2.2.1. Limited Generality

Traditional automation methods are often tailored to specific applications and interfaces, lacking adaptability to different apps and dynamic user environment Asadullah and Raza (2016); Clarke et al. (2016); Li et al. (2017); Patel and Pasha (2015). For example, automation scripts designed for a specific app may not function correctly if the app updates its interface or if the user switches to a different app with similar functionality. This inflexibility makes it difficult to extend automation across various usage scenarios without significant manual reconfiguration.

These methods typically follow predefined sequences of actions and cannot adjust their operations based on changing contexts or user preferences. For instance, if a user wants an automation to send a customized message to contacts who have birthdays on a particular day, traditional methods struggle because they cannot dynamically access and interpret data from the contacts app, calendar, and messaging app simultaneously. Similarly, automating tasks that require conditional logic—such as playing different music genres based on the time of day or weather conditions—poses a challenge for traditional automation tools, as they lack the ability to integrate real-time data and make intelligent decisions accordingly Liu et al. (2023); Majeed et al. (2020).

2.2.2. High Maintenance Costs

Writing and maintaining automation scripts require professional knowledge and are time-consuming and labor-intensive Kodali and Mahesh (2017); Kodali et al. (2019); Lamberton et al. (2017); Meironke and Kuehnel (2022); Moreira et al. (2023). Taking RPA as an example, as applications continually update and iterate, scripts need frequent modifications. When an application’s interface layout changes or functions are updated, RPA scripts originally written for the old version may not work properly, requiring professionals to spend considerable time and effort readjusting and optimizing the scripts Agostinelli et al. (2022); Ling et al. (2020); Tripathi (2018).

The high entry barrier also limits the popularity of some automation features Le et al. (2020); Roffarello et al. (2024). For example, Apple’s Shortcuts 4 can combine complex operations, such as starting an Apple Watch fitness workout, recording training data, and sending statistical data to the user’s email after the workout. However, setting up such a complex shortcut often requires the user to perform a series of complicated operations on the phone following fixed rules. This is challenging for ordinary users, leading many to abandon usage due to the complexity of manual script writing.

2.2.3. Poor Intent Comprehension

Rule-based and script-based systems can only execute predefined tasks or engage in simple natural language interactions Cowan et al. (2017); Kepuska and Bohouta (2018). Simple instructions like "open the browser" can be handled using traditional natural language processing algorithms, but complex instructions like "open the browser, go to Amazon, and purchase a product" cannot be completed. These traditional systems are based on fixed rules and lack in-depth understanding and parsing capabilities for complex natural language Anicic et al. (2010); Kang et al. (2013); Karanikolas et al. (2023).

They require users to manually write scripts to interact with the phone, greatly limiting the application of intelligent assistants that can understand complex human instructions. For example, when a user wants to check flight information for a specific time and book a ticket, traditional systems cannot accurately understand the user’s intent and automatically complete the series of related operations, necessitating manual script writing with multiple steps, which is cumbersome and requires high technical skills.

2.2.4. Weak Screen GUI Perception

Different applications present a wide variety of GUI elements, making it challenging for traditional methods like RPA to accurately recognize and interact with diverse controls Banerjee et al. (2013); Brich et al. (2017); Chen et al. (2018); Fu et al. (2024). Traditional automation often relies on fixed sequences of actions targeting specific controls or input fields, exhibiting Weak Screen GUI Perception that limits their ability to adapt to variations in interface layouts and component types. For example, in an e-commerce app, the product details page may include dynamic content like carousels, embedded videos, or interactive size selection menus, which differ significantly from the simpler layout of a search results page. Traditional methods may fail to accurately identify and interact with the "Add to Cart" button or select product options, leading to unsuccessful automation of purchasing tasks.

Moreover, traditional automation struggles with understanding complex screen information such as dynamic content updates, pop-up notifications, or context-sensitive menus that require adaptive interaction strategies. Without the ability to interpret visual cues like icons, images, or contextual hints, these methods cannot handle tasks that involve navigating through multi-layered interfaces or responding to real-time changes. For instance, automating the process of booking a flight may involve selecting dates from a calendar widget, choosing seats from an interactive seat map, or handling security prompts—all of which require sophisticated perception and interpretation of the interface Zhang et al. (2024e).

These limitations significantly impede the widespread application and deep development of traditional phone automation technologies. Without intelligent perception capabilities, automation cannot adapt to the complexities of modern app interfaces, which are increasingly dynamic and rich in interactive elements. This underscores the urgent need for new methods and technologies that can overcome these bottlenecks and achieve more intelligent, flexible, and efficient phone automation.

2.3. LLMs Boost Phone Automation

The advent of LLMs has marked a significant shift in the landscape of phone automation, enabling more dynamic, context-aware, and sophisticated interactions with mobile devices. As illustrated in Figure 3, the research on LLM-powered phone GUI agents has progressed through pivotal milestones, where models become increasingly adept at interpreting multimodal data, reasoning about user intents, and autonomously executing complex tasks. This section clarifies how LLMs address traditional limitations and examines why scaling laws can further propel large models in phone automation. As will be detailed in § 4 and § 5, LLM-based solutions for phone automation generally follow two routes: (1) Prompt Engineering, where pre-trained models are guided by carefully devised prompts, and (2) Training-Based Methods, where LLMs undergo additional optimization on GUI-focused datasets. The following subsections illustrate how LLMs mitigate the core challenges of traditional phone automation—ranging from contextual semantic understanding and GUI perception to reasoning and decision making—and briefly highlight the role of scaling laws in enhancing these capabilities.

Scaling Laws in LLM-Based Phone Automation. Scaling laws—originally observed in general-purpose LLMs, where increasing model capacity and training data yields emergent capabilities Brown et al. (2020); Hagendorff (2023); Kaplan et al. (2020)—have similarly begun to manifest in phone GUI automation. As datasets enlarge and encompass more diverse apps, usage scenarios, and user behaviors, recent findings Chen et al. (2024b); Cheng et al. (2024); Lu et al. (2024a); Pawlowski et al. (2024) show consistent gains in step-by-step automation tasks such as clicking buttons or entering text. This data scaling not only captures broader interface layouts and device contexts but also reveals latent “emergent” competencies, allowing LLMs to handle more abstract, multi-step instructions. Empirical evidence from in-domain scenarios Li et al. (2024a) further underscores how expanded coverage of phone apps and user patterns systematically refines automation accuracy. In essence, as model sizes and data complexity grow, phone GUI agents exploit these scaling laws to bridge the gap between user intent and real-world GUI interactions with increasing efficiency and sophistication.

Contextual Semantic Understanding. LLMs have transformed natural language processing for phone automation by learning from extensive textual corpora Brown et al. (2020); Devlin (2018); Radford (2018b); Vaswani (2017); Wen et al. (2024); Zhang et al. (2023a). This training captures intricate linguistic structures and domain knowledge Karanikolas et al. (2023), allowing agents to parse multi-step commands and generate context-informed responses. MobileAgent Wang et al. (2024b), for example, interprets user directives like scheduling appointments or performing transactions with high precision, harnessing the Transformer architecture Vaswani (2017) for efficient encoding of complex prompts. Consequently, phone GUI agents benefit from stronger natural language grounding, bridging user-intent gaps once prevalent in script-based systems.

Screen GUI with Multi-Modal Perception. Screen GUI perception in earlier phone automation systems typically depended on static accessibility trees or rigid GUI element detection, which struggled to adapt to changing app interfaces. Advances in LLMs, supported by large-scale multimodal datasets Chang et al. (2024); Minaee et al. (2024); Zhao et al. (2023), allow models to unify textual and visual signals in a single representation. Systems like UGround Gou et al. (2024), Ferret-UI You et al. (2024), and UI-Hawk Zhang et al. (2024d) excel at grounding natural language descriptions to on-screen elements, dynamically adjusting as interfaces evolve. Moreover, SeeClick Cheng et al. (2024) and ScreenAI Baechler et al. (2024) demonstrate that learning directly from screenshots—rather than purely textual metadata—can further enhance adaptability. By integrating visual cues with user language, LLM-based agents can respond more flexibly to a wide range of UI designs and interaction scenarios.

Reasoning and Decision Making. LLMs also enable advanced reasoning and decision-making by combining language, visual context, and historical user interactions. Pre-training on broad corpora equips these models with the capacity for complex reasoning Wang et al. (2023a); Yuan et al. (2024), multi-step planning Song et al. (2023a); Valmeekam et al. (2023), and context-aware adaptation Koike et al. (2024); Talukdar and Biswas (2024). MobileAgent-V2 Wang et al. (2024a), for instance, introduces a specialized planning agent to track task progress while a decision agent optimizes actions. AutoUI Zhang and Zhang (2023) applies a multimodal chain-of-action approach that accounts for both previous and forthcoming steps, and SteP Sodhi et al. (2024) uses stacked LLM modules to solve diverse web tasks. Similarly, MobileGPT Lee et al. (2023) leverages an app memory system to minimize repeated mistakes and bolster adaptability. Such architectures demonstrate higher success rates in complex phone operations, reflecting a new level of autonomy in orchestrating tasks that previously demanded handcrafted scripts.

Overall, LLMs are transforming phone automation by reinforcing semantic understanding, expanding multimodal perception, and enabling sophisticated decision-making strategies. The scaling laws observed in datasets like AndroidControl Li et al. (2024a) reinforce the notion that a larger volume and diversity of demonstrations consistently elevate model accuracy. As these techniques mature, LLM-driven phone GUI agents continue to redefine how users interact with mobile devices, ultimately paving the way for a more seamless and user-centric automation experience.

2.4. Emerging Commercial Applications

The integration of LLMs has enabled novel commercial applications that leverage phone automation, offering innovative solutions to real-world challenges. This subsection highlights several prominent cases, presented in chronological order based on their release dates, where LLM-based GUI agents are reshaping user experiences, improving efficiency, and providing personalized services.

Apple Intelligence. On June 11, 2024, Apple introduced its personal intelligent system, Apple Intelligence5, seamlessly integrating AI capabilities into iOS, iPadOS, and macOS. It enhances communication, productivity, and focus features through intelligent summarization, priority notifications, and context-aware replies. For instance, Apple Intelligence can summarize long emails, transcribe and interpret call recordings, and generate personalized images or “Genmoji.” A key aspect is on-device processing, which ensures user privacy and security. By enabling the system to operate directly on the user’s device, Apple Intelligence safeguards personal information while providing an advanced, privacy-preserving phone automation experience.

vivo PhoneGPT. On October 10, 2024, vivo unveiled OriginOS 56, its newest mobile operating system, featuring an AI agent ability named PhoneGPT. By harnessing large language models, PhoneGPT can understand user instructions, preferences, and on-screen information, autonomously engaging in dialogues and detecting GUI states to operate the smartphone. Notably, it allows users to order coffee or takeout with ease and can even carry out a full phone reservation process at a local restaurant through extended conversations. By integrating the capabilities of large language models with native system states and APIs, PhoneGPT illustrates the great potential of phone GUI agents.

Honor YOYO Agent. Released on October 24, 2024, the Honor YOYO Agent7 exemplifies an phone automation assistant that adapts to user habits and complex instructions. With just one voice or text command, YOYO can automate multi-step processes—such as comparing prices to secure discounts when shopping, automatically filling out forms, ordering beverages aligned with user preferences, or silencing notifications during online meetings. By learning from user behaviors, YOYO reduces the complexity of human-device interaction, offering a more effortless and intelligent phone experience.

Anthropic Claude Computer Use. On October 22, 2024, Anthropic unveiled the Computer Use feature for its Claude 3.5 Sonnet model8. This feature allows an AI agent to interact with a computer as if a human were operating it, observing screenshots, moving the virtual cursor, clicking buttons, and typing text. Instead of requiring specialized environment adaptations, the AI can “see” the screen and perform actions that humans would, bridging the gap between language-based instructions and direct computer operations. Although initial performance is still far below human proficiency, this represents a paradigm shift in human-computer interaction. By teaching AI to mimic human tool usage, Anthropic reframes the challenge from “tool adaptation for models” to “model adaptation to existing tools.” Achieving balanced performance, security, and cost-effectiveness remains an ongoing endeavor.

Zhipu.AI AutoGLM. On October 25, 2024, Zhipu.AI introduced AutoGLM Liu et al. (2024b), an intelligent agent that simulates human operations on smartphones. With simple text or voice commands, AutoGLM can like and comment on social media posts, purchase products, book train tickets, or order takeout. Its capabilities extend beyond mere API calls—AutoGLM can navigate interfaces, interpret visual cues, and execute tasks that mirror human interaction steps. This approach streamlines daily tasks and demonstrates the versatility and practicality of LLM-driven phone automation in commercial applications.

These emerging commercial applications—from Apple’s privacy-focused on-device intelligence to vivo’s PhoneGPT, Honor’s YOYO agent, Anthropic’s Computer Use, and Zhipu.AI’s AutoGLM—showcase how LLM-based agents are transcending traditional user interfaces. They enable more natural, efficient, and personalized human-device interactions. As models and methods continue to evolve, we can anticipate even more groundbreaking applications, further integrating AI into the fabric of daily life and professional workflows.

3. Frameworks and Components of Phone GUI Agents

LLM-powered phone GUI agents can be designed using different architectural paradigms and components, ranging from straightforward, single-agent systems Wang et al. (2023b, 2024b); Wen et al. (2024, 2023); Zhang et al. (2023a); to more elaborate multi-agent Wang et al. (2024a); Zhang et al. (2024b,f); or multi-stage Gou et al. (2024); Hoscilowicz et al. (2024); Zheng et al. (2024) approaches. A fundamental scenario involves a single agent that operates incrementally, without precomputing an entire action sequence from the outset. Instead, the agent continuously observes the dynamically changing mobile environment—where available UI elements, device states, and relevant contextual factors may shift in unpredictable ways—and cannot be exhaustively enumerated in advance. As a result, the agent must adapt its strategy step-by-step, making decisions based on the current situation rather than following a fixed plan. This iterative decision-making process can be effectively modeled using a Partially Observable Markov Decision Process (POMDP), a well-established framework for handling sequential decision-making under uncertainty Monahan (1982); Spaan (2012). By modeling the task as a POMDP, we capture its dynamic nature, the impossibility of pre-planning all actions, and the necessity of adjusting the agent’s approach at each decision point.

As illustrated in Figure 4, consider a simple example: the agent’s goal is to order a latte through the Starbucks app. The app’s interface may vary depending on network latency, promotions displayed, or the user’s last visited screen. The agent cannot simply plan all steps in advance; it must observe the current screen, identify which UI elements are present, and then choose an action (like tapping the Starbucks icon, swiping to a menu, or selecting the latte). After each action, the state changes, and the agent re-evaluates its options. This dynamic, incremental decision-making is precisely why POMDPs are a suitable framework. In the POMDP formulation for phone automation:

States (S). At each decision point, the agent’s perspective is described as a state, a comprehensive snapshot of all relevant information that could potentially influence the decision-making process. This state encompasses the current UI information (e.g., screenshots, UI trees, OCR-extracted text, icons), the phone’s own status (network conditions, battery level, location), and the task context (the user’s goal—“order a latte”—and the agent’s progress toward it). The state

S_{t}

represents the complete, underlying situation of the environment at time t, which may not be directly observable in its entirety.

Actions (A). Given the state

S_{t}

at time t, the agent selects from available actions (taps, swipes, typing text, launching apps) that influence the subsequent state. The details of how phone GUI agents make decisions are introduced in § 3.2, and the design of the action space is discussed in § 3.3.

Transition Dynamics ( $P (s^{'} | s, a)$ ). When the agent executes an action

a_{t}

at time t, it leads to a new state

S_{t + 1}

. Some transitions may be deterministic (e.g., tapping a known button reliably opens a menu), while others are uncertain (e.g., network delays, unexpected pop-ups). Mathematically, we have the transition probability

P (s^{'} | s, a)

which describes the likelihood of transitioning from state

S_{t}

to state

S_{t + 1}

given action

a_{t}

.

Observations (O). The agent receives observations

O_{t}

at time t which are partial and imperfect reflections of the true state

S_{t}

. In the phone automation context, these observations could be, for example, a glimpse of the visible UI elements (not the entire UI tree), a brief indication of the network status (such as a signal icon without detailed connection parameters), or a partial view of the battery level indicator. These observations

O_{t}

provide the agent with some, but not all, of the information relevant to the state

S_{t}

. The agent must infer and make decisions based on these limited observations, attempting to reach the desired goal state despite the partial observability. The details of phone GUI agent perception are discussed in § 3.1.

Under this POMDP-based paradigm, the agent aims to make decisions that lead to the goal state by observing the current state and choosing appropriate actions. It continuously re-evaluates its strategy as conditions evolve, promoting real-time responsiveness and dynamic adaptation. The agent observes the state $S_{t}$ at time t, chooses an action $a_{t}$ , and then based on the resulting observation $O_{t + 1}$ and new state $S_{t + 1}$ , refines its strategy.

As illustrated in Figure 5, frameworks of phone GUI agents aim to integrate perception, reasoning, and action capabilities into cohesive agents that can interpret user intentions, understand complex UI states, and execute appropriate operations within mobile environment. By examining these frameworks, we can identify best practices, guide future advancements, and choose the right approach for various applications and contexts.

To address limitations in adaptability and scalability, § 3.4 introduces multi-agent frameworks, where specialized agents collaborate, enhance efficiency, and handle more diverse tasks in parallel. Finally, § 3.5 presents the Plan-Then-Act Framework, which explicitly separates the planning phase from the execution phase. This approach allows agents to refine their conceptual plans before acting, potentially improving both accuracy and robustness.

3.1. Perception in Phone GUI Agents

Perception is a fundamental component of the basic framework for LLM-powered phone GUI agents. It is responsible for capturing and interpreting the state of the mobile environment, enabling the agent to understand the current context and make informed decisions. In the overall pipeline, perception serves as the initial step in the POMDP, providing the necessary input for the reasoning and action modules to operate effectively.

3.1.1. UI Information Perception

UI information is crucial for agents to interact seamlessly with the mobile interface. It can be further categorized into UI tree-based and screenshot-based approaches, supplemented by techniques like Set-of-Marks (SoM) and Icon & OCR enhancements.

UI tree is a structured, hierarchical representation of the UI elements on a mobile screen Medhi et al. (2013); Räsänen and Saarinen (2015). Each node in the UI tree corresponds to a UI component, containing attributes such as class type, visibility, and resource identifiers.9 Early datasets like PixelHelp Li et al. (2020), MoTIF Burns et al. (2021), and UIBert Bai et al. (2021) utilized UI tree data to enable tasks such as mapping natural language instructions to UI actions and performing interactive visual environment interactions. DroidBot-GPT Wen et al. (2023) was the first work to investigate how pre-trained language models can be applied to app automation without modifying the app or the model. DroidBot-GPT uses the UI tree as its primary perception information. The challenge lies in converting the structured UI tree into a format that LLMs can process effectively. DroidBot-GPT addresses this by transforming the UI tree into natural language sentences. Specifically, it extracts all user-visible elements, generates prompts like “A view <name>that can...” for each element, and combines them into a cohesive description of the current UI state. This approach mitigates the issue of excessively long and complex UI trees by presenting the information in a more natural and concise format suitable for LLMs. Subsequent developments, such as Enabling Conversational Interaction Wang et al. (2023b) and AutoDroid Wen et al. (2024), further refined this approach by representing the view hierarchy as HTML. Enabling Conversational Interaction introduces a method to convert the view hierarchy into HTML syntax, mapping Android UI classes to corresponding HTML tags and preserving essential attributes such as class type, text, and resource identifiers. This representation aligns closely with the training data distribution of LLMs, enhancing their ability to perform few-shot learning and improving overall UI understanding. AutoDroid extends this work by developing a GUI parsing module that converts the GUI into a simplified HTML representation using specific HTML tags like <button>, <checkbox>, <scroller>, <input>, and <p>. Additionally, AutoDroid implements automatic scrolling of scrollable components to ensure that comprehensive UI information is available to the LLM, thereby enhancing decision-making accuracy and reducing computational overhead.

Screenshots provide a visual snapshot of the current UI, capturing the appearance and layout of UI elements. Unlike UI trees, which require API access and can become unwieldy with complex hierarchies, screenshots offer a more flexible and often more comprehensive representation of the UI. Additionally, UI trees present challenges such as missing or overlapping controls and the inability to directly reference UI elements programmatically, making screenshots a more practical and user-friendly alternative for quickly assessing and sharing the state of a user interface. AutoUI Zhang and Zhang (2023) introduced a multimodal agent that relies on screenshots for GUI control, eliminating the dependency on UI trees. This approach allows the agent to interact with the UI directly through visual perception, enabling more natural and human-like interactions. AutoUI employs a chain-of-action technique that uses both previously executed actions and planned future actions to guide decision-making, achieving high action type prediction accuracy and efficient task execution. Following AutoUI, a series of multimodal solutions emerged, including MM-Navigator Yan et al. (2023), CogAgent Hong et al. (2024), AppAgent Zhang et al. (2023a), VisionTasker Song et al. (2024b), and MobileGPT Lee et al. (2023). These frameworks leverage screenshots in combination with supplementary information to enhance UI understanding and interaction capabilities.

Set-of-Mark (SoM) is a prompting technique used to annotate screenshots with OCR, icon, and UI tree information, thereby enriching the visual data with textual descriptors Yang et al. (2023). For example, MM-Navigator Yan et al. (2023) uses SoM to label UI elements with unique identifiers, allowing the LLM to reference and interact with specific components more effectively. This method has been widely adopted in subsequent works such as AppAgent Zhang et al. (2023a), VisionDroid Liu et al. (2024c), and OmniParser Lu et al. (2024b), which utilize SoM to enhance the agent’s ability to interpret and act upon UI elements based on visual, textual, and structural cues.

Icon & OCR enhancements provide additional layers of information that complement the visual data, enabling more precise action decisions. For instance, Mobile-Agent-v2 Wang et al. (2024a) integrates OCR and icon data with screenshots to provide a richer context for the LLM, allowing it to interpret and execute more complex instructions that require understanding both text and visual icons. Icon & OCR enhancements are employed in various works, including VisionTasker Song et al. (2024b), MobileGPT Lee et al. (2023), and OmniParser Lu et al. (2024b), to improve the accuracy and reliability of phone GUI agents.

3.1.2. Phone State Perception

Phone state information, such as keyboard status and location data, further contextualizes the agent’s interactions. For example, Mobile-Agent-v2 Wang et al. (2024a) uses keyboard status to determine when text input is required. Location data, while not currently utilized, represents a potential form of phone state information that could be used to recommend nearby services or navigate to specific addresses. This additional state information enhances the agent’s ability to perform context-aware actions, making interactions more intuitive and efficient.

The perception information gathered through UI trees, screenshots, SoM, OCR, and phone state is converted into prompt tokens that the LLM can process. This conversion is crucial for enabling seamless interaction between the perception module and the reasoning and action modules. Detailed methodologies for transforming perception data into prompt formats are discussed in § 4.1.

3.2. Brain in Phone GUI Agents

The brain of an LLM-based phone automation agent is its cognitive core, primarily constituted by a LLM. The LLM serves as the agent’s reasoning and decision-making center, enabling it to interpret inputs, generate appropriate responses, and execute actions within the mobile environment. Leveraging the extensive knowledge embedded within LLMs, agents benefit from advanced language understanding, contextual awareness, and the ability to generalize across diverse tasks and scenarios.

3.2.1. Storage

Storage encompasses both memory and knowledge, which are critical for maintaining context and informing the agent’s decision-making processes.

Memory refers to the agent’s ability to retain information from past interactions with users and the environment Xi et al. (2023). This is particularly useful for cross-application operations, where continuity and coherence are essential for completing multi-step tasks. For example, Mobile-Agent-v2 Wang et al. (2024a) integrates a memory unit that records task-related focus content from historical screens. This memory is accessed by the decision-making module when generating operations, ensuring that the agent can reference and update relevant information dynamically.

Knowledge pertains to the agent’s understanding of phone automation tasks and the functionalities of various apps. This knowledge can originate from multiple sources:

Pre-trained Knowledge. LLMs are inherently equipped with a vast amount of general knowledge, including common-sense reasoning and familiarity with programming and markup languages such as HTML. This pre-existing knowledge allows the agent to interpret and generate meaningful actions based on the UI representations.
Domain-Specific Training. Some agents enhance their knowledge by training on phone automation-specific datasets. Works such as AutoUI Zhang and Zhang (2023), CogAgent Hong et al. (2024), ScreenAI Baechler et al. (2024), CoCo-agent Ma et al. (2024), and Ferret-UI You et al. (2024) have trained LLMs on datasets tailored for mobile UI interactions, thereby improving their capability to understand and manipulate mobile interfaces effectively. For a more detailed discussion of knowledge acquisition through model training, see § 4.2.
Knowledge Injection. Agents can enhance their decision-making by incorporating knowledge derived from exploratory interactions and stored contextual information. This involves utilizing data collected during offline exploration phases or from observed human demonstrations to inform the LLM’s reasoning process. For instance, AutoDroid Wen et al. (2024) explores app functionalities and records UI transitions in a UI Transition Graph (UTG) memory, which are then used to generate task-specific prompts for the LLM. Similarly, AppAgent Zhang et al. (2023a) compiles knowledge from autonomous interactions and human demonstrations into structured documents, enabling the LLM to make informed decisions based on comprehensive UI state information and task requirements.

3.2.2. Decision Making

Decision Making is the process by which the agent determines the appropriate actions to perform based on the current perception and stored information Xi et al. (2023). The LLM processes the input prompts, which include the current UI state, historical interactions from memory, and relevant knowledge, to generate action sequences that accomplish the assigned tasks.

Planning involves devising a sequence of actions to achieve a specific task goal Song et al. (2023a); Xi et al. (2023). Effective planning is essential for decomposing complex tasks into manageable steps and adapting to changes in the environment. For instance, Mobile-Agent-v2 Wang et al. (2024a) incorporates a planning agent that generates task progress based on historical operations, ensuring effective operation generation by the decision agent. Additionally, approaches like Dynamic Planning of Thoughts (D-PoT) have been proposed to dynamically adjust plans based on environmental feedback and action history, significantly improving accuracy and adaptability in task execution Zhang et al. (2024f).

Reasoning enables the agent to interpret and analyze information to make informed decisions Chen et al. (2024e); Gandhi et al. (2024); Plaat et al. (2024). It involves understanding the context, evaluating possible actions, and selecting the most appropriate ones to achieve the desired outcome. By leveraging chain-of-thought(COT) Wei et al. (2022), LLMs enhance their reasoning capabilities, allowing them to think step-by-step and handle intricate decision-making processes. This structured approach facilitates the generation of coherent and logical action sequences, ensuring that the agent can navigate complex UI interactions effectively.

Reflection allows the agent to assess the outcomes of its actions and make necessary adjustments to improve performance Shinn et al. (2024). It involves evaluating whether the executed actions meet the expected results and identifying any discrepancies or errors. For example, Mobile-Agent-v2 Wang et al. (2024a) includes a reflection agent that evaluates whether the decision agent’s operations align with the task goals. If discrepancies are detected, the reflection agent generates appropriate remedial measures to correct the course of action. This continuous feedback loop enhances the agent’s reliability and ensures that it can recover from unexpected states or errors during task execution.

By integrating robust planning, advanced reasoning, and reflective capabilities, the Decision Making component of the Brain ensures that LLM-powered phone GUI agents can perform tasks intelligently and adaptively. These mechanisms enable the agents to handle a wide range of scenarios, maintain task continuity, and improve their performance over time through iterative learning and adjustment.

3.3. Action in Phone GUI Agents

The Action component is a critical part of LLM-powered phone GUI agents, responsible for executing decisions made by the Brain within the mobile environment. By bridging high-level commands generated by the LLM with low-level device operations, the agent can effectively interact with the phone’s UI and system functionalities. Actions encompass a wide variety of operations, ranging from simple interactions like tapping a button to complex tasks such as launching applications or modifying device settings. Execution mechanisms leverage tools like Android’s UI Automator Patil et al. (2016), iOS’s XCTest Lodi (2021), or popular automation frameworks such as Appium Singh et al. (2014) and Selenium Gundecha (2015); Sinclair to send precise commands to the phone. Through these mechanisms, the agent ensures that decisions are translated into tangible, reliable operations on the device.

The types of actions in phone GUI agents are diverse and can be broadly categorized based on their functionalities. Table 1 summarizes these actions, providing a clear overview of the operations agents can perform.

The above categories reflect the key interactions required for phone automation. Touch interactions form the foundation of UI navigation, while gesture-based actions add flexibility for dynamic control. Typing and input enable text-based operations, whereas system operations and media controls extend the agent’s capabilities to broader device functionalities. By combining these actions, phone GUI agents can achieve high accuracy and adaptability in executing user tasks, ensuring a seamless experience even in complex and dynamic environment.

3.4. Multi-Agent Framework

While single-agent frameworks based on LLMs have achieved significant progress in screen understanding and reasoning, they operate as isolated entities Dorri et al. (2018); Torreno et al. (2017). This isolation limits their flexibility and scalability in complex tasks that may require diverse, coordinated skills and adaptive capabilities. Single-agent systems may struggle with tasks that demand continuous adjustments based on real-time feedback, multi-stage decision-making, or specialized knowledge in different domains. Furthermore, they lack the ability to leverage shared knowledge or collaborate with other agents, reducing their effectiveness in dynamic environment Song et al. (2024c); Tan et al.; Wang et al. (2024a); Xi et al. (2023).

Multi-agent frameworks address these limitations by facilitating collaboration among multiple agents, each with specialized functions or expertise Chen et al. (2019); Li et al. (2024b). This collaborative approach enhances task efficiency, adaptability, and scalability, as agents can perform tasks in parallel or coordinate their actions based on their specific capabilities. As illustrated in Figure 6, multi-agent frameworks in phone automation can be categorized into two primary types: the Role-Coordinated Multi-Agent Framework and the Scenario-Based Task Execution Framework. These frameworks enable more flexible, efficient, and robust solutions in phone automation by either organizing agents based on general functional roles or dynamically assembling specialized agents according to specific task scenarios.

3.4.1. Role-Coordinated Multi-Agent

In the Role-Coordinated Multi-Agent Framework, agents are assigned general functional roles such as planning, decision-making, memory management, reflection, or tool invocation. These agents collaborate through a predefined workflow, with each agent focusing on its specific function to collectively achieve the overall task. This approach is particularly beneficial for tasks that require a combination of these general capabilities, allowing each agent to specialize and optimize its role within the workflow.

For example, in MMAC-Copilot Song et al. (2024c), multiple agents with distinct general functions collaborate as an OS copilot. The Planner strategically manages and allocates tasks to other agents, optimizing workflow efficiency. Meanwhile, the Librarian handles information retrieval and provides foundational knowledge, and the Programmer is responsible for coding and executing scripts, directly interacting with the software environment. The Viewer interprets complex visual information and translates it into actionable commands, while the Video Analyst processes and analyzes video content. Additionally, the Mentor offers strategic oversight and troubleshooting support. Each agent contributes its specialized function to the collaborative workflow, thereby enhancing the system’s overall capability to handle complex interactions with the operating system.

Similarly, in MobileAgent-v2 Wang et al. (2024a), three agents with general roles are utilized: a planning agent, a decision agent, and a reflection agent. The planning agent compresses historical actions and state information to provide a concise representation of task progress. The decision agent uses this information to navigate the task effectively, while the reflection agent monitors the outcomes of actions and corrects any errors, ensuring accurate task completion. This role-based collaboration reduces context length, improves task progression, and enhances focus content retention through a memory unit managed by the decision agent.

In general computer automation, Cradle Tan et al. leverages foundational agents with general roles to achieve versatile computer control. Agents specialize in functions like command generation or state monitoring, enabling Cradle to tackle general-purpose tasks across multiple software environment.

3.4.2. Scenario-Based Task Execution

In the Scenario-Based Task Execution Framework, tasks are dynamically assigned to specialized agents based on specific task scenarios or application domains. Each agent is endowed with capabilities tailored to a particular scenario, such as shopping, code editing, or navigation. By assigning tasks to agents specialized in the relevant domain, the system improves task success rates and efficiency.

For instance, MobileExperts Zhang et al. (2024b) forms different expert agents through an Expert Exploration phase. In the exploration phase, each agent receives tailored tasks broken down into sub-tasks to streamline the exploration process. Upon completion of a sub-task, the agent extracts three types of memories from its trajectory: interface memories, procedural memories (tools), and insight memories for use in subsequent execution phases. When a new task arrives, the system dynamically forms an expert team by selecting agents whose expertise matches the task requirements, enabling them to collaboratively execute the task more effectively. Similarly, in the SteP Sodhi et al. (2024) framework, agents are specialized based on specific web scenarios such as shopping, GitLab, maps, Reddit, or CMS platforms. Each scenario agent possesses specific capabilities and knowledge relevant to its domain. When a task is received, it is dynamically assigned to the appropriate scenario agent, which executes the task leveraging its specialized expertise. This approach enhances flexibility and adaptability, allowing the system to handle a wide range of tasks across different domains more efficiently.

Through dynamic task assignment and specialization, the Scenario-Based Task Execution Framework optimizes multi-agent systems to adapt to diverse and evolving contexts, significantly enhancing both the efficiency and effectiveness of task execution. As illustrated in Figure 6, the Role-Coordinated Framework relies on agents with general functional roles collaborating through a fixed workflow, suitable for tasks requiring a combination of general capabilities. In contrast, the Scenario-Based Framework dynamically assigns tasks to specialized agents tailored to specific scenarios, providing a flexible structure that adapts to the varying complexity and requirements of real-world tasks.

Despite the potential of multi-agent frameworks in phone automation, several challenges remain. In the Role-Coordinated Framework, coordinating agents with general functions requires efficient workflow design and may introduce overhead in communication and synchronization. In the Scenario-Based Framework, maintaining and updating a diverse set of specialized agents can be resource-intensive, and dynamically assigning tasks requires effective task recognition and agent selection mechanisms. Future research could explore hybrid frameworks that combine the strengths of both approaches, leveraging general functional agents while also incorporating specialized scenario agents as needed. Additionally, developing advanced algorithms for agent collaboration, learning, and adaptation can further enhance the intelligence and robustness of multi-agent systems. Integrating external knowledge bases, real-time data sources, and user feedback can also improve agents’ decision-making capabilities and adaptability in dynamic environment.

3.5. Plan-Then-Act Framework

While single-agent and multi-agent frameworks enhance adaptability and scalability, some tasks benefit from explicitly separating high-level planning from low-level execution. This leads to what we term the Plan-Then-Act Framework. In this paradigm, the agent first formulates a conceptual plan—often expressed as human-readable instructions—before grounding and executing these instructions on the device’s UI.

The Plan-Then-Act approach addresses a fundamental challenge: although LLMs and multimodal LLMs (MLLMs) excel at interpreting instructions and reasoning about complex tasks, they frequently struggle to precisely map their textual plans to concrete UI actions. By decoupling these stages, the agent can focus on what should be done (planning) and then handle how to do it on the UI (acting). Recent works highlight the effectiveness of this approach:

• SeeAct Zheng et al. (2024) demonstrates that GPT-4V(ision) Achiam et al. (2023) can generate coherent plans for navigating websites. However, bridging the gap between textual plans and underlying UI elements remains challenging. By clearly delineating planning from execution, the system can better refine its plan before finalizing actions.

• UGround Gou et al. (2024) and related efforts You et al. (2024); Zhang et al. (2024d) emphasize advanced visual grounding. Under a Plan-Then-Act framework, the agent first crafts a task solution plan, then relies on robust visual grounding models to locate and manipulate UI components. This modular design enhances performance across diverse GUIs and platforms, as the grounding model can evolve independently of the planning mechanism.

• LiMAC (Lightweight Multi-modal App Control) Christianos et al. (2024) also embodies a Plan-Then-Act spirit. LiMAC’s Action Transformer (AcT) determines the required action type (the plan), and a specialized VLM is invoked only for natural language needs. By structuring decision-making and text generation into distinct stages, LiMAC improves responsiveness and reduces compute overhead, ensuring that reasoning and UI interaction are cleanly separated.

• ClickAgent Hoscilowicz et al. (2024) similarly employs a two-phase approach. The MLLM handles reasoning and action planning, while a separate UI location model pinpoints the relevant coordinates on the screen. Here, the MLLM’s plan of which element to interact with is formed first, and only afterward is the element’s exact location identified and the action executed.

The Plan-Then-Act Framework offers several advantages. Modularity allows improvements in planning without requiring changes to the UI grounding and execution modules, and vice versa. Error Mitigation enables the agent to revise its plan before committing to actions; if textual instructions are ambiguous or infeasible, they can be corrected, reducing wasted actions and improving reliability. Additionally, improved visual grounding models, OCR enhancements, and scenario-specific knowledge can further refine the Plan-Then-Act approach, making agents more adept at handling intricate, real-world tasks. In summary, the Plan-Then-Act Framework represents a natural evolution in designing LLM-powered phone GUI agents. By separating planning from execution, agents can achieve clearer reasoning, improved grounding, and ultimately more effective and reliable task completion.

4. LLMs for Phone Automation

LLMs Achiam et al. (2023); Brown (2020); Radford (2018a); Radford et al. (2019) have emerged as a transformative technology in phone automation, bridging natural language inputs with executable actions. By leveraging their advanced language understanding, reasoning, and generalization capabilities, LLMs enable agents to interpret complex user intents, dynamically interact with diverse mobile applications, and effectively manipulate GUIs.

In this section, we explore two primary approaches to leveraging LLMs for phone automation: Training-Based Methods and Prompt Engineering. Figure 7 illustrates the differences between these two approaches in the context of phone automation. Training-Based Methods involve adapting LLMs specifically for phone automation tasks through techniques like supervised fine-tuning Chen et al. (2024b); Cheng et al. (2024); Lu et al. (2024a); Pawlowski et al. (2024) and reinforcement learning Bai et al. (2024); Song et al. (2024a); Wang et al. (2024g). These methods aim to enhance the models’ capabilities by training them on GUI-specific data, enabling them to understand and interact with GUIs more effectively. Prompt Engineering, on the other hand, focuses on designing input prompts to guide pre-trained LLMs to perform desired tasks without additional training Chen et al. (2022); Wei et al. (2022); Yao et al. (2024). By carefully crafting prompts that include relevant information such as task descriptions, interface states, and action histories, users can influence the model’s behavior to achieve specific automation goals Song et al. (2023b); Wen et al. (2023); Zhang et al. (2023a).

4.1. Prompt Engineering

LLMs like the GPT series Brown (2020); Radford (2018a); Radford et al. (2019) have demonstrated remarkable capabilities in understanding and generating human-like text. These models have revolutionized natural language processing by leveraging massive amounts of data to learn complex language patterns and representations.

Prompt engineering is the practice of designing input prompts to effectively guide LLMs to produce desired outputs for specific tasks Chen et al. (2022); Wei et al. (2022); Yao et al. (2024). By carefully crafting the prompts, users can influence the model’s behavior without the need for additional training or fine-tuning. This approach allows for leveraging the general capabilities of pre-trained models to perform a wide range of tasks by simply providing appropriate instructions or examples in the prompt.

In the context of phone automation, prompt engineering enables the utilization of general-purpose LLMs to perform automation tasks on mobile devices. Recently, a plethora of works have emerged that apply prompt engineering to achieve phone automation Liu et al. (2024c); Lu et al. (2024b); Song et al. (2023b); Taeb et al. (2024); Wang et al. (2024a,b); Wen et al. (2023); Yan et al. (2023); Yang et al.(2024); Zhang et al. (2023a, 2024b). These works leverage the strengths of LLMs in natural language understanding and reasoning to interpret user instructions and generate corresponding actions on mobile devices.

The fundamental approach to achieving phone automation through prompt engineering entails the creation of prompts that encapsulate a comprehensive set of information. These prompts should include a detailed task description, such as searching for the best Korean restaurant on Yelp. They also integrate the current UI information of the phone, which may encompass screenshots, SoM, UI tree structures, icon details, and OCR data. Additionally, the prompts should account for the phone’s real-time state, including its location, battery level, and keyboard status, as well as any pertinent action history and the range of possible actions (action space). The COT prompt Wei et al. (2022) is also a crucial component, guiding the thought process for the next operation. The LLM then analyzes this rich prompt and determines the subsequent action to execute. This methodical process is vividly depicted in Figure 8.

This section explores the application of prompt engineering in phone automation, categorizing related works based on the type of prompts used: Text-Based Prompt and Multimodal Prompt. As illustrated in Figure 9, the approach to automation significantly diverges between these two prompt types. Table 2 summarizes notable methods, highlighting their main UI information, the type of model used, and other relevant details such as task types and grounding strategies.

4.1.1. Text-Based Prompt

In the domain of text-based prompt automation, the primary architecture involves a single text-modal LLM serving as the agent for mobile device automation. This agent operates by interpreting UI information presented in the form of a UI tree. It is important to note that, to date, the approaches discussed have primarily utilized UI tree data and have not extensively incorporated OCR text and icon information. We believe that solely relying on OCR and icon information is insufficient for fully representing screen UI information; instead, as demonstrated in Mobile-agent-v2 Wang et al. (2024a), they are best used as auxiliary information alongside screenshots. These text-based prompt agents make decisions by selecting elements from a list of candidates based on the textual description of the UI elements. For instance, to initiate a search, the LLM would identify and select the search button by its index within the UI tree rather than its screen coordinates, as depicted in Figure 9.

The study by Enabling Conversational Wang et al. (2023b) marked a significant step in this field. It explored the use of task descriptions, action spaces, and UI trees to map instructions to UI actions. However, it focused solely on the execution of individual instructions without delving into sequential decision-making processes. DroidBot-GPT Wen et al. (2023) is a landmark in applying pre-trained language models to app automation. It is the first to explore the use of LLMs for app automation without requiring modifications to the app or the model. DroidBot-GPT perceives UI trees, which are structural representations of the app’s UI, and integrates user-provided tasks along with action spaces and output requirements. This allows the model to engage in sequential decision-making and automate tasks effectively. AutoDroid Wen et al. (2024) takes this concept further. It employs a UI Transition Graph (UTG) generated through random exploration to create an App Memory. This memory, combined with the commonsense knowledge of LLMs, enhances decision-making and significantly advances the capabilities of phone GUI agents. MobileGPT Lee et al. (2023) introduces a hierarchical decision-making process. It simulates human cognitive processes—exploration, selection, derivation, and recall—to augment the efficiency and reliability of LLMs in mobile task automation. Lastly, AXNav Taeb et al. (2024) showcases an innovative application of Prompt Engineering in accessibility testing. AXNav interprets natural language instructions and executes them through an LLM, streamlining the testing process and improving the detection of accessibility issues, thus aiding the manual testing workflows of QA professionals.

Each of these contributions, while unique in their approach, is united by the common thread of Prompt Engineering. They demonstrate the versatility and potential of text-based prompt automation in enhancing the interaction between LLMs and mobile applications.

4.1.2. Multimodal Prompt

With the advancement of large pre-trained models, Multimodal Large Language Models (MLLMs) have demonstrated exceptional performance across various domains Achiam et al. (2023); Bai et al. (2023); Chen et al. (2024c,d); Li et al. (2023b); Liu et al. (2024a); Wang et al. (2024e, 2023d); Ye et al. (2023), significantly contributing to the evolution of phone automation. Unlike text-only models, multimodal models integrate visual and textual information, addressing limitations such as the inability to access UI trees, missing control information, and inadequate global screen representation. By leveraging screenshots for decision-making, multimodal models facilitate a more natural simulation of human interactions with mobile devices, enhancing both accuracy and robustness in automated operations.

The fundamental framework for multimodal phone automation is illustrated in Figure 9. Multimodal prompts integrate visual perception (e.g., screenshots) and textual information (e.g., UI tree, OCR, and icon data) to guide MLLMs in generating actions. The action outputs can be categorized into two methods: SoM-Based Indexing Methods and Direct Coordinate Output Methods. These methods define how the agent identifies and interacts with UI elements, either by referencing annotated indices or by pinpointing precise coordinates.

SoM-Based Indexing Methods. SoM-based methods involve annotating UI elements with unique identifiers within the screenshot, allowing the MLLM to reference these elements by their indices when generating actions. This approach mitigates the challenges associated with direct coordinate outputs, such as precision and adaptability to dynamic interfaces. MM-Navigator Yan et al. (2023) represents a breakthrough in zero-shot GUI navigation using GPT-4V Achiam et al. (2023). By employing SoM prompting Yang et al. (2023), MM-Navigator annotates screenshots through OCR and icon recognition, assigning unique numeric IDs to actionable widgets. This enables GPT-4V to generate indexed action descriptions rather than precise coordinates, enhancing action execution accuracy. Building upon the SoM-based approach, AppAgent Zhang et al. (2023a) integrates autonomous exploration and human demonstration observation to construct a comprehensive knowledge base. This framework allows the agent to navigate and operate smartphone applications through simplified action spaces, such as tapping and swiping, without requiring backend system access. Tested across 10 different applications and 50 tasks, AppAgent showcases superior adaptability and efficiency in handling diverse high-level tasks, further advancing multimodal phone automation. OmniParser Lu et al. (2024b) enhances the SoM-based method by introducing a robust screen parsing technique. It combines fine-tuned interactive icon detection models and functional captioning models to convert UI screenshots into structured elements with bounding boxes and labels. This comprehensive parsing significantly improves GPT-4V’s ability to generate accurately grounded actions, ensuring reliable operation across multiple platforms and applications.

Direct Coordinate Output Methods. Direct coordinate output methods enable MLLMs to determine the exact (x, y) positions of UI elements from screenshots, facilitating precise interactions without relying on indexed references. This approach leverages the advanced visual grounding capabilities of MLLMs to interpret and interact with the UI elements directly. VisionTasker Song et al. (2024b) introduces a two-stage framework that combines vision-based UI understanding with LLM task planning. Utilizing models like YOLOv8 Varghese and Sambath (2024) and PaddleOCR Du et al. (2020), VisionTasker parses screenshots to identify widgets and textual information, transforming them into natural language descriptions. This structured semantic representation allows the LLM to perform step-by-step task planning, enhancing the accuracy and practicality of automated mobile task execution. The Mobile-Agent series Wang et al. (2024a,b) leverages visual perception tools to accurately identify and locate both visual and textual UI elements within app screenshots. Mobile-Agent-v1 utilizes coordinate-based actions, enabling precise interaction with UI elements. Mobile-Agent-v2 extends this by introducing a multi-agent architecture comprising planning, decision, and reflection agents. MobileExperts Zhang et al. (2024b) advances the direct coordinate output method by incorporating tool formulation and multi-agent collaboration. This dynamic, tool-enabled agent team employs a dual-layer planning mechanism to efficiently execute multi-step operations while reducing reasoning costs by approximately 22%. By dynamically assembling specialized agents and utilizing reusable code block tools, MobileExperts demonstrates enhanced intelligence and operational efficiency in complex phone automation tasks. VisionDroid Liu et al. (2024c) applies MLLMs to automated GUI testing, focusing on detecting non-crash functional bugs through vision-based UI understanding. By aligning textual and visual information, VisionDroid enables the MLLM to comprehend GUI semantics and operational logic, employing step-by-step task planning to enhance bug detection accuracy. Evaluations across multiple datasets and real-world applications highlight VisionDroid’s superior performance in identifying and addressing functional bugs.

While multimodal prompt strategies have significantly advanced phone automation by integrating visual and textual data, they still face notable challenges. Approaches that do not utilize SoM maps and instead directly output coordinates rely heavily on the MLLM’s ability to accurately ground UI elements for precise manipulation. Although recent innovations Liu et al. (2024c); Wang et al. (2024a); Zhang et al. (2024b) have made progress in addressing the limitations of MLLMs’ grounding capabilities, there remains considerable room for improvement. Enhancing the robustness and accuracy of UI grounding is essential to achieve more reliable and scalable phone automation.

4.2. Training-Based Methods

The subsequent sections delve into these approaches, discussing the development of task-specific model architectures, supervised fine-tuning strategies and reinforcement learning techniques in both general-purpose and domain-specific scenarios.

4.2.1. Task-Specific LLM-based Agents

To advance AI agents for phone automation, significant efforts have been made to develop Task Specific Model Architectures that are tailored to understand and interact with GUIs by integrating visual perception with language understanding. These models address unique challenges posed by GUI environment, such as varying screen sizes, complex layouts, and diverse interaction patterns. A summary of notable Task Specific Model Architectures is presented in Figure 3, highlighting their main contributions, domains, and other relevant details.

General-Purpose. The general-purpose GUI-specific LLMs are designed to handle a wide range of tasks across different applications and interfaces. They focus on enhancing direct GUI interaction, high-resolution visual recognition, and comprehensive perception to improve the capabilities of AI agents in understanding and navigating complex mobile GUIs. One significant challenge in this domain is enabling agents to interact directly with GUIs without relying on environment parsing or application-specific APIs, which can introduce inefficiencies and error propagation. To tackle this, Auto-GUI Zhang and Zhang (2023) presents a multimodal agent that directly engages with the interface. It introduces a chain-of-action technique that leverages previous action histories and future action plans, enhancing the agent’s decision-making process and leading to improved performance in GUI control tasks. High-resolution input is essential for recognizing tiny UI elements and text prevalent in GUIs. CogAgent Hong et al. (2024) addresses this by employing both low-resolution and high-resolution image encoders within its architecture. Supporting input resolutions up to

1120 \times 1120

, CogAgent effectively recognizes small page elements and text. Understanding UIs and infographics requires models to interpret complex visual languages and design principles. ScreenAI Baechler et al. (2024) improves upon existing architectures by introducing a flexible patching strategy and a novel textual representation for UIs. During pre-training, this representation teaches the model to interpret UI elements effectively. Leveraging large language models, ScreenAI automatically generates training data at scale, covering a wide spectrum of tasks in UI and infographic understanding. Enhancing both perception and action response is crucial for comprehensive GUI automation. CoCo-Agent Ma et al. (2024) proposes two novel approaches: comprehensive environment perception (CEP) and conditional action prediction (CAP). CEP enhances GUI perception through multiple aspects, including visual channels (screenshots and detailed layouts) and textual channels (historical actions). CAP decomposes action prediction into determining the action type first, then identifying the action target conditioned on the action type. Addressing the need for effective GUI agents in applications featuring extensive Mandarin content, MobileFlow Nong et al. (2024) introduces a multimodal LLM specifically designed for mobile GUI agents. MobileFlow employs a hybrid visual encoder trained on a vast array of GUI pages, enabling it to extract and comprehend information across diverse interfaces. The model incorporates a Mixture of Experts (MoE) and specialized modality alignment training tailored for GUIs. Collectively, these general-purpose Task Specific Model Architectures address key challenges in phone automation by enhancing direct GUI interaction, high-resolution visual recognition, comprehensive environment perception, and conditional action prediction. By leveraging multimodal inputs and innovative architectural designs, these models significantly advance the capabilities of AI agents in understanding and navigating complex mobile GUIs, paving the way for more intelligent and autonomous phone automation solutions.

Domain-Specific. Domain-specific Task Specific Model Architectures have primarily focused on screen understanding tasks, which are essential for enabling AI agents to interact effectively with graphical user interfaces. These tasks can be categorized into three main types: UI grounding, UI referring, and screen question answering (QA). Figure 10 illustrates the differences between these categories.

Table 3. Summary of task-specific model architectures

Method	Date	Task Type	Backbone	Size	Contributions
Auto-GUI Zhang and Zhang (2023)	2023.09	General	N/A	60M / 200M / 700M	Direct screen interaction; Chain-of-action; Uses action histories and future plans
CogAgent Hong et al. (2024)	2023.12	General	CogVLM	18B	High-res input ( $1120 \times 1120$ ); Specialized in GUI understanding
WebVLN-Net Chen et al. (2024a)	2023.12	Screen Understanding, QA	N/A	N/A	Web navigation with visual and HTML content; WebVLN-v1 dataset
ScreenAI Baechler et al. (2024)	2024.02	Screen Understanding, QA	N/A	4.6B	UI and infographic understanding; Flexible patching; Three datasets released
CoCo-Agent Ma et al. (2024)	2024.02	General	LLaVA (LLaMA-2-chat-7B, CLIP)	N/A	Comprehensive perception; Conditional action prediction; Enhanced automation
Ferret-UI You et al. (2024)	2024.04	Screen Understanding, Grounding	Ferret	N/A	"Any resolution" techniques; Precise referring and grounding
LVG Qian et al. (2024b)	2024.06	Screen Understanding, Grounding	SWIN Transformer, BERT	N/A	Visual UI grounding; Layout-guided contrastive learning
MobileFlow Nong et al. (2024)	2024.07	General	Qwen-VL-Chat	21B	Hybrid visual encoders; Variable resolutions; Multilingual support
UI-Hawk Zhang et al. (2024d)	2024.08	Screen Understanding, Grounding	N/A	N/A	History-aware encoder; Screen stream processing; FunUI benchmark

UI Grounding involves identifying and localizing UI elements on a screen that correspond to a given natural language description. This task is critical for agents to perform precise interactions with GUIs based on user instructions. LVG (Layout-guided Visual Grounding) Qian et al. (2024b) addresses UI grounding by unifying detection and grounding of UI elements within application interfaces. LVG tackles challenges such as application sensitivity, where UI elements with similar appearances have different functions across applications, and context sensitivity, where the functionality of UI elements depends on their context within the interface. By introducing layout-guided contrastive learning, LVG learns the semantics of UI objects from their visual organization and spatial relationships, improving grounding accuracy. UI-Hawk Zhang et al. (2024d) enhances UI grounding by incorporating a history-aware visual encoder and an efficient resampler to process screen sequences during GUI navigation. By understanding historical screens, UI-Hawk improves the agent’s ability to ground UI elements accurately over time. An automated data curation method generates training data for UI grounding, contributing to the creation of the FunUI benchmark for evaluating screen understanding capabilities.
UI Referring focuses on generating natural language descriptions for specified UI elements on a screen. This task enables agents to explain UI components to users or other agents, facilitating better communication and interaction. Ferret-UI You et al. (2024) is a multimodal LLM designed for enhanced understanding of mobile UI screens, emphasizing precise referring and grounding tasks. By incorporating any resolution techniques to handle various screen aspect ratios and dividing screens into sub-images for detailed analysis, Ferret-UI generates accurate descriptions of UI elements. Training on a curated dataset of elementary UI tasks, Ferret-UI demonstrates strong performance in UI referring tasks. UI-Hawk Zhang et al. (2024d) also contributes to UI referring by defining tasks that require the agent to generate descriptions for UI elements based on their role and context within the interface. By processing screen sequences and understanding the temporal relationships between screens, UI-Hawk improves the agent’s ability to refer to UI elements accurately.
Screen Question Answering involves answering questions about the content and functionality of a screen based on visual and textual information. This task requires agents to comprehend complex screen layouts and extract relevant information to provide accurate answers. ScreenAI Baechler et al. (2024) specializes in understanding screen UIs and infographics, leveraging the common visual language and design principles shared between them. By introducing a flexible patching strategy and a novel textual representation for UIs, ScreenAI pre-trains models to interpret UI elements effectively. Using large language models to automatically generate training data, ScreenAI covers tasks such as screen annotation and screen QA. WebVLN Chen et al. (2024a) extends vision-and-language navigation to websites, where agents navigate based on question-based instructions and answer questions using information extracted from target web pages. By integrating visual inputs, linguistic instructions, and web-specific content like HTML, WebVLN enables agents to understand both the visual layout and underlying structure of web pages, enhancing screen QA capabilities. UI-Hawk Zhang et al. (2024d) further enhances screen QA by enabling agents to process screen sequences and answer questions based on historical interactions. By incorporating screen question answering as one of its fundamental tasks, UI-Hawk improves the agent’s ability to comprehend and reason about screen content over time.

These domain-specific Task Specific Model Architectures demonstrate the importance of focusing on screen understanding tasks to enhance AI agents’ interaction with complex user interfaces. By categorizing these tasks into UI grounding, UI referring, and screen question answering, researchers have developed specialized models that address the unique challenges within each category. Integrating innovative techniques such as layout-guided contrastive learning, history-aware visual encoding, and flexible patching strategies has led to significant advancements in agents’ abilities to understand, navigate, and interact with GUIs effectively.

4.2.2. Supervised Fine-Tuning

Supervised fine-tuning has emerged as a crucial technique for enhancing the capabilities of LLMs in GUI tasks within phone automation. By tailoring models to specific tasks through fine-tuning on curated datasets, researchers have significantly improved models’ abilities in GUI grounding, optical character recognition (OCR), cross-application navigation, and efficiency. A summary of notable works in this area is presented in Table 4, highlighting their main contributions, domains, and other relevant details.

General-Purpose. Supervised fine-tuning has been effectively applied to develop more versatile and efficient GUI agents by enhancing their fundamental abilities and GUI knowledge. One of the fundamental challenges in developing visual GUI agents is enabling accurate interaction with screen elements based solely on visual inputs, known as GUI grounding. SeeClick Cheng et al. (2024) addresses this challenge by introducing a visual GUI agent that relies exclusively on screenshots for task automation, circumventing the need for extracted structured data like HTML, which can be lengthy and sometimes inaccessible. Recognizing that GUI grounding is a key hurdle, SeeClick enhances the agent’s capability by incorporating GUI grounding pre-training. The authors also introduce ScreenSpot, the first realistic GUI grounding benchmark encompassing mobile, desktop, and web environment. Experimental results demonstrate that improving GUI grounding through supervised fine-tuning directly correlates with enhanced performance in downstream GUI tasks. Beyond grounding, agents require robust OCR capabilities and comprehensive knowledge of GUI components and interactions to function effectively across diverse applications. GUICourse Chen et al. (2024b) tackles these challenges by presenting a suite of datasets designed to train visual-based GUI agents from general VLMs. The GUIEnv dataset strengthens OCR and grounding abilities by providing 10 million website page-annotation pairs for pre-training and 0.7 million region-text QA pairs for supervised fine-tuning. To enrich the agent’s understanding of GUI components and interactions, the GUIAct and GUIChat datasets offer extensive single-step and multi-step action instructions and conversational data with text-rich images and bounding boxes. As users frequently navigate across multiple applications to complete complex tasks, enabling cross-app GUI navigation becomes essential for practical GUI agents.GUI Odyssey Lu et al. (2024a) addresses this need by introducing a comprehensive dataset specifically designed for training and evaluating cross-app navigation agents. The GUI Odyssey dataset comprises 7,735 episodes from six mobile devices, covering six types of cross-app tasks, 201 apps, and 1,399 app combinations. By fine-tuning the Qwen-VL model with a history resampling module on this dataset, they developed OdysseyAgent, a multimodal cross-app navigation agent. Extensive experiments show that OdysseyAgent achieves superior accuracy compared to existing models, significantly improving both in-domain and out-of-domain performance on cross-app navigation tasks. Efficiency and scalability are also critical considerations, especially for deploying GUI agents on devices with limited computational resources. TinyClick Pawlowski et al. (2024) demonstrates that even compact models can achieve strong performance on GUI automation tasks through effective supervised fine-tuning strategies. Utilizing the Vision-Language Model Florence-2-Base, TinyClick focuses on the primary task of identifying the screen coordinates of UI elements corresponding to user commands. By employing multi-task training and Multimodal Large Language Model-based data augmentation, TinyClick significantly improves model performance while maintaining a compact size of 0.27 billion parameters and minimal latency.

Domain-Specific. Supervised fine-tuning has also been applied to domain-specific tasks to address specialized challenges in particular contexts, such as reference resolution and accessibility. In the context of Reference Resolution in GUI Contexts, ReALM Moniz et al. (2024) formulates reference resolution as a language modeling problem, enabling the model to handle various types of references, including on-screen entities, conversational entities, and background entities. By converting reference resolution into a multiple-choice task for the LLM, ReALM significantly improves the model’s ability to resolve references in GUI contexts. For Accessibility and UI Icons Alt-Text Generation, IconDesc Haque and Csallner (2024) addresses the challenge of generating informative alt-text for mobile UI icons, which is essential for users relying on screen readers. Traditional deep learning approaches require extensive datasets and struggle with the diversity and imbalance of icon types. IconDesc introduces a novel method using Large Language Models to autonomously generate alt-text with partial UI data, such as class, resource ID, bounds, and contextual information from parent and sibling nodes. By fine-tuning an off-the-shelf LLM on a small dataset of approximately 1.4k icons, IconDesc demonstrates significant improvements in generating relevant alt-text, aiding developers in enhancing UI accessibility during app development.

These works collectively demonstrate that supervised fine-tuning is instrumental in advancing GUI agents for phone automation. By addressing specific challenges through targeted datasets and training strategies—whether enhancing GUI grounding, improving OCR and GUI knowledge, enabling cross-app navigation, or optimizing for accessibility—researchers have significantly enhanced the performance and applicability of GUI agents. The advancements summarized in Figure 4 highlight the ongoing efforts and progress in this field, paving the way for more intelligent, versatile, and accessible phone automation solutions capable of handling complex tasks in diverse environment.

4.2.3. Reinforcement Learning

Reinforcement Learning (RL) ? has emerged as a powerful technique for training agents to interact autonomously with GUIs across various platforms, including phones, web browsers, and desktop environment. Although RL-based approaches for phone GUI agents are relatively few, significant progress has been made in leveraging RL to enhance agent capabilities in dynamic and complex GUI environment. In this section, we discuss RL approaches for GUI agents across different platforms, highlighting their unique challenges, methodologies, and contributions. A summary of notable RL-based methods is presented in Figure Figure 5, which includes specific RL-related features such as the type of RL used (online or offline) and the targeted platform.

Phone Agents. Training phone GUI agents using RL presents unique challenges due to the dynamic and complex nature of mobile applications. Agents must adapt to real-world stochasticity and handle the intricacies of interacting with diverse mobile environment. Recent works have addressed these challenges by developing RL frameworks that enable agents to learn from interactions and improve over time. DigiRL Bai et al. (2024) andDistRL Wang et al. (2024g) both tackle the limitations of pre-trained vision-language models (VLMs) in decision-making tasks for device control through GUIs. Recognizing that static demonstrations are insufficient due to the dynamic nature of real-world mobile environment, these works introduce RL approaches to train agents capable of in-the-wild device control. DigiRL proposes an autonomous RL framework that employs a two-stage training process: an initial offline RL phase to initialize the agent using existing data, followed by an offline-to-online RL phase that fine-tunes the model based on its own interactions. By building a scalable Android learning environment with a VLM-based evaluator, DigiRL identifies key design choices for effective RL in mobile GUI domains. The agent learns to handle real-world stochasticity and dynamism, achieving significant improvements over supervised fine-tuning, with a 49.5% absolute increase in success rate on the Android-in-the-Wild dataset. Similarly, DistRL introduces an asynchronous distributed RL framework specifically designed for on-device control agents on mobile devices. To address inefficiencies in online fine-tuning and the challenges posed by dynamic mobile environment, DistRL employs centralized training and decentralized data acquisition. Leveraging an off-policy RL algorithm tailored for distributed and asynchronous data utilization, DistRL improves training efficiency and agent performance by prioritizing significant experiences and encouraging exploration. Experiments show that DistRL achieves a 20% relative improvement in success rate compared to state-of-the-art methods on general Android tasks. Building upon these advancements, AutoGLM Liu et al. (2024b) extends the application of RL to both phone and web platforms. AutoGLM presents a series of foundation agents based on the ChatGLM model family, aiming to serve as autonomous agents for GUI control. A key insight from this work is the design of an intermediate interface that separates planning and grounding behaviors, allowing for more agile development and enhanced performance. By employing self-evolving online curriculum RL, AutoGLM enables agents to learn from environmental interactions and adapt to dynamic GUI environment. The approach demonstrates impressive success rates on various benchmarks, showcasing the potential of RL in creating versatile GUI agents across platforms.

Table 5. Summary of reinforcement learning methods for phone GUI agents

Method	Date	Platform	RL Type	Backbone	Size
DigiRL Bai et al. (2024)	2024.06	Phone	Online RL	AutoUI-Base	200M
DistRL Wang et al. (2024g)	2024.10	Phone	Online RL	T5-based	1.3B
AutoGLM Liu et al. (2024b)	2024.11	Phone, Web	Online RL	GLM-4-9B-Base	9B
ScreenAgent Niu et al. (2024)	2024.02	PC OS	N/A	CogAgent	18B
ETO Song et al. (2024a)	2024.03	Web	Offline-to-Online RL	LLaMA-2-7B-Chat	7B
AutoWebGLM Lai et al. (2024)	2024.04	Web	RL (Curriculum Learning, Bootstrapped RL)	ChatGLM3-6B	6B
Agent Q Putta et al. (2024)	2024.08	Web	Offline RL with MCTS	LLaMA-3-70B	70B

Web Agents. Web navigation tasks involve interacting with complex and dynamic web environment, where agents must interpret web content and perform actions to achieve user-specified goals. RL has been employed to train agents that can adapt to these challenges by learning from interactions and improving decision-making capabilities. ETO Song et al. (2024a) (Exploration-based Trajectory Optimization) and Agent Q Putta et al. (2024) both focus on enhancing the performance of LLM-based agents in web environment through RL techniques. ETO introduces a learning method that allows agents to learn from their exploration failures by iteratively collecting failure trajectories and using them to create contrastive trajectory pairs for training. By leveraging contrastive learning methods like Direct Preference Optimization (DPO), ETO enables agents to improve performance through an iterative cycle of exploration and training. Experiments on tasks such as WebShop demonstrate that ETO consistently outperforms baselines, highlighting the effectiveness of learning from failures. Agent Q combines guided Monte Carlo Tree Search (MCTS) with a self-critique mechanism and iterative fine-tuning using an off-policy variant of DPO. This framework allows LLM agents to learn from both successful and unsuccessful trajectories, improving generalization in complex, multi-step reasoning tasks. Evaluations on the WebShop environment and real-world booking scenarios show that Agent Q significantly improves success rates, outperforming behavior cloning and reinforcement learning fine-tuned baselines. AutoWebGLM Lai et al. (2024) contributes to this domain by developing an LLM-based web navigating agent built upon ChatGLM3-6B. To address the complexity of HTML data and the versatility of web actions, AutoWebGLM introduces an HTML simplification algorithm to represent webpages succinctly. The agent is trained using a hybrid human-AI method to build web browsing data for curriculum training and is further enhanced through reinforcement learning and rejection sampling. AutoWebGLM demonstrates performance superiority on general webpage browsing tasks, achieving practical usability in real-world services. Collectively, these works demonstrate how RL techniques can be applied to web agents to improve their ability to navigate and interact with complex web environment. By learning from interactions, failures, and leveraging advanced planning methods, these agents exhibit enhanced reasoning and decision-making capabilities.

PC OS Agents. In desktop environment, agents face the challenge of interacting with complex software applications and operating systems, requiring precise control actions and understanding of GUI elements. RL approaches in this domain focus on enabling agents to perform multi-step tasks and adapt to the intricacies of desktop GUIs. ScreenAgent Niu et al. (2024) constructs an environment where a Vision Language Model (VLM) agent interacts with a real computer screen via the VNC protocol. By observing screenshots and manipulating the GUI through mouse and keyboard actions, the agent operates within an automated control pipeline that includes planning, acting, and reflecting phases. This design allows the agent to continuously interact with the environment and complete multi-step tasks. ScreenAgent introduces the ScreenAgent Dataset, which collects screenshots and action sequences for various daily computer tasks. The trained model demonstrates computer control capabilities comparable to GPT-4V and exhibits precise UI positioning capabilities, highlighting the potential of RL in desktop GUI automation.

Reinforcement Learning has proven to be a valuable approach for training GUI agents across various platforms, enabling them to learn from interactions with dynamic environment and improve their performance over time. By leveraging RL techniques, these agents can adapt to real-world stochasticity, handle complex decision-making tasks, and exhibit enhanced autonomy in phone, web, and desktop environment. The works discussed in this section showcase the progress made in developing intelligent and versatile GUI agents through RL, paving the way for enhanced automation and user interaction across diverse platforms.

5. Datasets and Benchmarks

The rapid evolution of mobile technology has transformed smartphones into indispensable tools for communication, productivity, and entertainment. This shift has spurred a growing interest in developing intelligent agents capable of automating tasks and enhancing user interactions with mobile devices. These agents rely on a deep understanding of GUIs and the ability to interpret and execute instructions effectively. However, the development of such agents presents significant challenges, including the need for diverse datasets, standardized benchmarks, and robust evaluation methodologies.

Datasets serve as the backbone for training and testing phone GUI agents, offering rich annotations and task diversity to enable these agents to learn and adapt to complex environment. Complementing these datasets, benchmarks provide structured environment and evaluation metrics, allowing researchers to assess agent performance in a consistent and reproducible manner. Together, datasets and benchmarks form the foundation for advancing the capabilities of GUI-based agents.

This section delves into the key datasets and benchmarks that have shaped the field. Subsection 5.1 reviews notable datasets that provide the training data necessary for enabling agents to perform tasks such as language grounding, UI navigation, and multimodal interaction. Subsection 5.2 discusses benchmarks that facilitate the evaluation of agent performance, focusing on their contributions to reproducibility, generalization, and scalability. Through these resources, researchers and developers gain the tools needed to push the boundaries of intelligent phone automation, moving closer to creating agents that can seamlessly assist users in their daily lives.

5.1. Datasets

The development of phone automation and GUI-based agents has been significantly propelled by the availability of diverse and richly annotated datasets. These datasets provide the foundation for training and evaluating models that can understand and interact with mobile user interfaces using natural language instructions. In this subsection, we review several key datasets, highlighting their unique contributions and how they collectively advance the field. Table 6 summarizes these datasets, providing an overview of their characteristics.

Early efforts in dataset creation focused on mapping natural language instructions to UI actions. PixelHelp Li et al. (2020) pioneered this area by introducing a problem of grounding natural language instructions to mobile UI action sequences. It decomposed the task into action phrase extraction and grounding, enabling models to interpret instructions like "Turn on flight mode" and execute corresponding UI actions. Building on this, UGIF Venkatesh et al. (2022) extended the challenge to a multilingual and multimodal setting. UGIF addressed cross-modal and cross-lingual retrieval and grounding, providing a dataset with instructions in English and UI interactions across multiple languages, thus highlighting the complexities of multilingual UI instruction following.

Addressing task feasibility and uncertainty, MoTIF Burns et al. (2021) introduced a dataset that includes natural language commands which may not be satisfiable within the given UI context. By incorporating feasibility annotations and follow-up questions, MoTIF encourages research into how agents can recognize and handle infeasible tasks, enhancing robustness in interactive environment.

For advancing UI understanding through pre-training, UIBert Bai et al. (2021) proposed a Transformer-based model that jointly learns from image and text representations of UIs. By introducing novel pre-training tasks that leverage the correspondence between different UI features, UIBert demonstrated improvements across multiple downstream UI tasks, setting a foundation for models that require a deep understanding of GUI layouts and components.

In the realm of multimodal dialogues and interactions, Meta-GUI Sun et al. (2022) proposed a GUI-based task-oriented dialogue system. This work collected dialogues paired with GUI operation traces, enabling agents to perform tasks through conversational interactions and direct GUI manipulations. It bridges the gap between language understanding and action execution within mobile applications.

Recognizing the need for large-scale datasets to train more generalizable agents, several works introduced extensive datasets capturing a wide range of device interactions. Android In The Wild (AITW) Rawles et al. (2024b) released a dataset containing hundreds of thousands of episodes with human demonstrations of device interactions. It presents challenges where agents must infer actions from visual appearances and handle precise gestures. Building upon AITW, Android In The Zoo (AITZ) Zhang et al. (2024c) provided fine-grained semantic annotations using the Chain-of-Action-Thought (CoAT) paradigm, enhancing agents’ ability to reason and make decisions in GUI navigation tasks.

To address the complexities of cross-application navigation, GUI Odyssey Lu et al. (2024a) introduced a dataset specifically designed for training and evaluating agents that navigate across multiple apps. By covering diverse apps, tasks, and devices, GUI Odyssey enables the development of agents capable of handling real-world scenarios that involve integrating multiple applications and transferring context between them.

Understanding how data scale affects agent performance, AndroidControl Li et al. (2024a) studied the impact of training data size on computer control agents. By collecting demonstrations with both high-level and low-level instructions across numerous apps, this work analyzed in-domain and out-of-domain generalization, providing insights into the scalability of fine-tuning approaches for device control agents.

Finally, focusing on detailed annotations to enhance agents’ understanding of UI elements, AMEX Chai et al. (2024) introduced a comprehensive dataset with multi-level annotations. It includes GUI interactive element grounding, functionality descriptions, and complex natural language instructions with stepwise GUI-action chains. AMEX aims to align agents more closely with human users by providing fundamental knowledge and understanding of the mobile GUI environment from multiple levels, thus facilitating the training of agents with a deeper understanding of page layouts and UI element functionalities.

Collectively, these datasets represent significant strides in advancing phone automation and GUI-based agent research. They address various challenges, from language grounding and task feasibility to large-scale device control and cross-app navigation. By providing rich annotations and diverse scenarios, they enable the training and evaluation of more capable, robust, and generalizable agents, moving closer to the goal of intelligent and autonomous phone automation solutions.

5.2. Benchmarks

The development of mobile GUI-based agents is not only reliant on the availability of diverse datasets but is also significantly influenced by the presence of robust benchmarks. These benchmarks offer standardized environment, tasks, and evaluation metrics, which are essential for consistently and reproducibly assessing the performance of agents. They enable researchers to compare different models and approaches under identical conditions, thus facilitating collaborative progress. In this subsection, we will review some of the notable benchmarks that have been introduced to evaluate phone GUI agents, highlighting their unique features and contributions. A summary of these benchmarks is provided in Table 7, which allows for a comparative understanding of their characteristics.

5.2.1. Evaluation Pipelines

Early benchmarks in the field of phone GUI agents focused on creating controlled environment for training and evaluating these agents. MobileEnv Zhang et al. (2023b), for example, introduced a universal platform for the training and evaluation of mobile interactions. It provided an isolated and controllable setting, with support for intermediate instructions and rewards. This emphasis on reliable evaluations and the ability to more naturally reflect real-world usage scenarios was a significant step forward.

To address the challenges presented by the complexities of modern operating systems and their vast action spaces, AndroidArena Xing et al. (2024) was developed. This benchmark was designed to evaluate large language model (LLM) agents within a complex Android environment. It introduced scalable and semi-automated methods for benchmark construction, with a particular focus on cross-application collaboration and user constraints such as security concerns.

Recognizing the limitations in scalability and faithfulness of existing evaluation approaches, LlamaTouch Zhang et al. (2024e) presented a novel testbed. This testbed enabled on-device mobile UI task execution and provided a means for faithful and scalable task evaluation. It introduced fine-grained UI component annotation and a multi-level application state matching algorithm. These features allowed for the accurate detection of critical information in each screen, enhancing the evaluation’s accuracy and adaptability to dynamic UI changes.

B-MoCA Lee et al. (2024) expanded the focus of benchmarking to include mobile device control agents across diverse configurations. By incorporating a randomization feature that could change device configurations such as UI layouts and language settings, B-MoCA was able to more effectively assess agents’ generalization performance. It provided a realistic benchmark with 131 practical tasks, highlighting the need for agents to handle a wide range of real-world scenarios.

To provide a dynamic and reproducible environment for autonomous agents, AndroidWorld Rawles et al. (2024a) introduced an Android environment with 116 programmatic tasks across 20 real-world apps. This benchmark emphasized the importance of ground-truth rewards and the ability to dynamically construct tasks that were parameterized and expressed in natural language. This enabled testing on a much larger and more realistic suite of tasks.

For the specific evaluation of mobile LLM agents, MobileAgentBench Wang et al. (2024d) proposed an efficient and user-friendly benchmark. It addressed challenges in scalability and usability by offering 100 tasks across 10 open-source apps. The benchmark also simplified the extension process for developers and ensured that it was fully autonomous and reliable.

In the domain of GUI function testing, AUITestAgent Hu et al. (2024) introduced the first automatic, natural language-driven GUI testing tool for mobile apps. By decoupling interaction and verification into separate modules and employing a multi-dimensional data extraction strategy, it enhanced the automation and accuracy of GUI testing. The practical usability of this tool was demonstrated in real-world deployments.

Finally,AndroidLab Xu et al. (2024) presented a systematic Android agent framework. This framework included an operation environment with different modalities and a reproducible benchmark. Supporting both LLMs and large multimodal models (LMMs), it provided a unified platform for training and evaluating agents. Additionally, it came with an Android Instruction dataset that significantly improved the performance of open-source models.

Collectively, these benchmarks have made substantial contributions to the advancement of phone GUI agents. They have achieved this by providing diverse environment, tasks, and evaluation methodologies. They have addressed various challenges, including scalability, reproducibility, generalization across configurations, and the integration of advanced models like LLMs and LMMs. By facilitating rigorous testing and comparison, they have played a crucial role in driving the development of more capable and robust phone GUI agents.

5.2.2. Evaluation Metrics

Evaluation metrics are crucial for measuring the performance of phone GUI agents, providing quantitative indicators of their effectiveness, efficiency, and reliability. This section categorizes and explains the various metrics used across different benchmarks based on their primary functions.

Task Completion Metrics. Task Completion Metrics assess how effectively an agent finishes assigned tasks. Task Completion Rate indicates the proportion of successfully finished tasks, with AndroidWorld Rawles et al. (2024a) exemplifying its use for real-device assessments. Sub-Goal Success Rate further refines this by examining each sub-goal within a larger task, as employed by AndroidLab Xu et al. (2024), making it particularly relevant for complex tasks that require segmentation. End-to-end Task Completion Rate, used by LlamaTouch Zhang et al. (2024e), offers a holistic measure of whether an agent can see an entire multi-step task through to completion without interruption.

Action Execution Quality Metrics. These metrics evaluate the agent’s precision and correctness when performing specific actions. Action Accuracy, adopted by AUITestAgent Hu et al. (2024) and AutoDroid Zhang and Zhang (2023), compares each executed action to the expected one. Correct Step measures the fraction of accurate steps in an action sequence, whereas Correct Trace quantifies the alignment of the entire action trajectory with the ground truth. Operation Logic checks if the agent follows logical procedures to meet task objectives, as AndroidArena Xing et al. (2024) demonstrates. Reasoning Accuracy, highlighted in AUITestAgent Hu et al. (2024), gauges how well the agent logically interprets and responds to task requirements.

Resource Utilization and Efficiency Metrics. These indicators measure how efficiently an agent handles system resources and minimizes redundant operations. Resource Consumption, tracked by AUITestAgent Hu et al. (2024) via Completion Tokens and Prompt Tokens, reveals how much computational cost is incurred. Step Efficiency, applied by AUITestAgent and MobileAgentBench Wang et al. (2024d), compares actual steps to an optimal lower bound, while Reversed Redundancy Ratio, used by AndroidArena Xing et al. (2024) and AndroidLab Xu et al. (2024), evaluates unnecessary detours in the action path.

Task Understanding and Reasoning Metrics. These metrics concentrate on the agent’s comprehension and analytical skills. Oracle Accuracy and Point Accuracy, used by AUITestAgent Hu et al. (2024), assess how well the agent interprets task instructions and verification points. Reasoning Accuracy indicates the correctness of the agent’s logical deductions during execution, and Nuggets Mining, employed by AndroidArena Xing et al. (2024), measures the ability to extract key contextual information from the UI environment.

Format and Compliance Metrics. These metrics verify whether the agent operates within expected format constraints. Invalid Format and Invalid Action, for example, are tracked in AndroidArena Xing et al. (2024) to confirm that an agent’s outputs adhere to predefined structures and remain within permissible action ranges.

Completion Awareness and Reflection Metrics. Such metrics evaluate the agent’s recognition of task boundaries and its capacity to learn from prior steps. Awareness of Completion, explored in AndroidArena Xing et al. (2024), ensures the agent terminates at the correct time. Reflexion@K measures adaptive learning by examining how effectively the agent refines its performance over multiple iterations.

Evaluation Accuracy and Reliability Metrics. These indicators measure the consistency and reliability of the evaluation process. Accuracy, as used in LlamaTouch Zhang et al. (2024e), validates alignment between the evaluation approach and manual verification, ensuring confidence in performance comparisons across agents.

Reward and Overall Performance Metrics. These metrics combine various performance facets into aggregated scores. Task Reward, employed by AndroidArena Xing et al. (2024), provides a single effectiveness measure encompassing several factors. Average Reward, used in MobileEnv Zhang et al. (2023b), further reflects consistent performance across multiple tasks, indicating the agent’s stability and reliability.

These evaluation metrics together provide a comprehensive framework for assessing various dimensions of phone GUI agents. They cover aspects such as effectiveness, efficiency, reliability, and the ability to adapt and learn. By using these metrics, benchmarks can objectively compare the performance of different agents and systematically measure improvements. This enables researchers to identify strengths and weaknesses in different agent designs and make informed decisions about future development directions.

6. Challenges and Future Directions

Integrating LLMs into phone automation has propelled significant advancements but also introduced numerous challenges. Overcoming these challenges is essential for fully unlocking the potential of intelligent phone GUI agents. This section outlines key issues and possible directions for future work, encompassing dataset development, scaling fine-tuning, lightweight on-device deployment, user-centric adaptation, improving model capabilities, standardizing benchmarks, and ensuring reliability and security.

Dataset Development and Fine-Tuning Scalability. The performance of LLMs in phone automation heavily depends on datasets that capture diverse, real-world scenarios. Existing datasets often lack the breadth needed for comprehensive coverage. Future efforts should focus on developing large-scale, annotated datasets covering a wide range of applications, user behaviors, languages, and device types Rawles et al. (2024b); Zhang et al. (2024c). Incorporating multimodal inputs—e.g., screenshots, UI trees, and natural language instructions—can help models better understand complex user interfaces. However, scaling fine-tuning to achieve robust out-of-domain performance remains a challenge. As shown by AndroidControl Li et al. (2024a), obtaining reliable results for high-level tasks outside the training domain may require one to two orders of magnitude more data than currently feasible. Fine-tuning alone may not suffice. Future directions should explore hybrid training methodologies, unsupervised learning, transfer learning, and auxiliary tasks to improve generalization without demanding prohibitively large datasets.

Lightweight and Efficient On-Device Deployment. Deploying LLMs on mobile devices confronts substantial computational and memory constraints. Current hardware often struggles to support large models with minimal latency and power consumption. Approaches such as model pruning, quantization, and efficient transformer architectures can address these constraints Ding (2024). Recent innovations demonstrate promising progress. Octopus v2 Chen and Li (2024) shows that a 2-billion parameter on-device model can outpace GPT-4 in accuracy and latency, while Lightweight Neural App Control Christianos et al. (2024) achieves substantial speed and accuracy improvements by distributing tasks efficiently. Moreover, specialized hardware accelerators and edge computing solutions can further reduce dependency on the cloud, enhance privacy, and improve responsiveness Wang et al. (2024a).

User-Centric Adaptation: Interaction and Personalization. Current agents often rely on extensive human intervention to correct errors or guide task execution, undermining seamless user experiences. Enhancing the agent’s ability to understand user intent and reducing manual adjustments is crucial. Future research should improve natural language understanding, incorporate voice commands and gestures, and enable agents to learn continuously from user feedback Lee et al. (2023); Wang et al. (2024a,b). Personalization is equally important. One-size-fits-all solutions are insufficient given users’ diverse preferences and usage patterns. Agents should quickly adapt to new tasks and user-specific contexts without costly retraining. Integrating manual teaching, zero-shot learning, and few-shot learning can help agents generalize from minimal user input Lee et al. (2023); Sodhi et al. (2024); Song et al. (2024b), making them more flexible and universally applicable.

Advancing Model Capabilities: Grounding, Reasoning, and Beyond. Accurately grounding language instructions in specific UI elements is a major hurdle. Although LLMs excel at language understanding, mapping instructions to precise UI interactions requires improved multimodal grounding. Future work should integrate advanced vision models, large-scale annotations, and more effective fusion techniques Cheng et al. (2024); Gou et al. (2024); You et al. (2024); Zhang et al. (2024d). Beyond grounding, improving reasoning, long-horizon planning, and adaptability in complex scenarios remains essential. Agents must handle intricate workflows, interpret ambiguous instructions, and dynamically adjust strategies as contexts evolve. Achieving these goals will likely involve new architectures, memory mechanisms, and inference algorithms that extend beyond current LLM capabilities.

Standardizing Evaluation Benchmarks. Objective and reproducible benchmarks are imperative for comparing model performance. Existing benchmarks often target narrow tasks or limited domains, complicating comprehensive evaluations. Unified benchmarks covering diverse tasks, app types, and interaction modalities would foster fair comparisons and encourage more versatile and robust solutions Lu et al. (2024a); Rawles et al. (2024b);Wang et al. (2024d); Xu et al. (2024). These benchmarks should provide standardized metrics, scenarios, and evaluation protocols, enabling researchers to identify strengths, weaknesses, and paths for improvement with greater clarity.

Ensuring Reliability and Security. As agents gain access to sensitive data and perform critical tasks, reliability and security are paramount. Current systems may be susceptible to adversarial attacks, data breaches, and unintended actions. Robust security protocols, error-handling techniques, and privacy-preserving methods are needed to protect user information and maintain user trust Bai et al. (2024); Ma et al. (2024). Continuous monitoring and validation processes can detect vulnerabilities and mitigate risks in real-time Lee et al. (2023). Ensuring that agents behave predictably, respect user privacy, and maintain consistent performance under challenging conditions will be crucial for widespread adoption and long-term sustainability.

Addressing these challenges involves concerted efforts in data collection, model training strategies, hardware optimization, user-centric adaptation, improved grounding and reasoning, standardized benchmarks, and strong security measures. By advancing these areas, the next generation of LLM-powered phone GUI agents can become more efficient, trustworthy, and capable, ultimately delivering seamless, personalized, and secure experiences for users in dynamic mobile environment.

7. Conclusion

In this paper, we have presented a comprehensive survey of recent developments in LLM-driven phone automation technologies, illustrating how large language models can catalyze a paradigm shift from static script-based approaches to dynamic, intelligent systems capable of perceiving, reasoning about, and operating on mobile GUIs. We examined a variety of frameworks, including single-agent architectures, multi-agent collaborations, and plan-then-act pipelines, demonstrating how each approach addresses specific challenges in task complexity, adaptability, and scalability. In parallel, we analyzed both prompt engineering and training-based techniques (such as supervised fine-tuning and reinforcement learning), underscoring their roles in bridging user intent and device action.

Beyond clarifying these technical foundations, we also spotlighted emerging research directions and provided a critical appraisal of persistent obstacles. These include ensuring robust dataset coverage, optimizing LLM deployments under resource constraints, meeting real-world demand for user-centric personalization, and maintaining security and reliability in sensitive applications. We further emphasized the need for standardized benchmarks, proposing consistent metrics and evaluation protocols to fairly compare and advance competing designs.

Looking ahead, ongoing refinements in model architectures, on-device inference strategies, and multimodal data integration point to an exciting expansion of what LLM-based phone GUI agents can achieve. We anticipate that future endeavors will see the convergence of broader AI paradigms—such as embodied AI and AGI—into phone automation, thereby enabling agents to handle increasingly complex tasks with minimal human oversight. Overall, this survey not only unifies existing strands of research but also offers a roadmap for leveraging the full potential of large language models in phone GUI automation, guiding researchers toward robust, user-friendly, and secure solutions that can adapt to the evolving needs of mobile ecosystems.

References

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
Simone Agostinelli, Marco Lupia, Andrea Marrella, and Massimo Mecella. 2022. Reactive synthesis of software robots in rpa from user interface logs. Computers in Industry, 142:103721. [CrossRef]
Simone Agostinelli, Andrea Marrella, and Massimo Mecella. 2019. Research challenges for intelligent robotic process automation. In Business Process Management Workshops: BPM 2019 International Workshops, Vienna, Austria, September 1–6, 2019, Revised Selected Papers 17, pages 12–18. Springer.
Stefano V Albrecht and Peter Stone. 2018. Autonomous agents modelling other agents: A comprehensive survey and open problems. Artificial Intelligence, 258:66–95. [CrossRef]
Domenico Amalfitano, Anna Rita Fasolino, Porfirio Tramontana, Salvatore De Carmine, and Atif M Memon. 2012. Using gui ripping for automated testing of android applications. In Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, pages 258–261.
Domenico Amalfitano, Anna Rita Fasolino, Porfirio Tramontana, Bryan Dzung Ta, and Atif M Memon. 2014. Mobiguitar: Automated model-based testing of mobile apps. IEEE software, 32(5):53–59. [CrossRef]
Saleema Amershi, Maya Cakmak, William Bradley Knox, and Todd Kulesza. 2014. Power to the people: The role of humans in interactive machine learning. AI magazine, 35(4):105–120. [CrossRef]
Darko Anicic, Paul Fodor, Sebastian Rudolph, Roland Stühmer, Nenad Stojanovic, and Rudi Studer. 2010. A rule-based language for complex event processing and reasoning. In Web Reasoning and Rule Systems: Fourth International Conference, RR 2010, Bressanone/Brixen, Italy, September 22-24, 2010. Proceedings 4, pages 42–57. Springer.
GEM Anscombe. 2000. Intention.
Yauhen Leanidavich Arnatovich and Lipo Wang. 2018. A systematic literature review of automated techniques for functional gui testing of mobile applications. arXiv preprint arXiv:1812.11470.
Muhammad Asadullah and Ahsan Raza. 2016. An overview of home automation systems. In 2016 2nd international conference on robotics and artificial intelligence (ICRAI), pages 27–31. IEEE.
Tanzirul Azim and Iulian Neamtiu. 2013. Targeted and depth-first exploration for systematic testing of android apps. In Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications, pages 641–660.
Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Cărbune, Jason Lin, Jindong Chen, and Abhanshu Sharma. 2024. Screenai: A vision-language model for ui and infographics understanding. arXiv preprint arXiv:2402.04615.
Chongyang Bai, Xiaoxue Zang, Ying Xu, Srinivas Sunkara, Abhinav Rastogi, Jindong Chen, et al. 2021. Uibert: Learning generic multimodal representations for ui understanding. arXiv preprint arXiv:2107.13731.
Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. 2024. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning. arXiv preprint arXiv:2406.11896.
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966.
Ishan Banerjee, Bao Nguyen, Vahid Garousi, and Atif Memon. 2013. Graphical user interface (gui) testing: Systematic mapping and repository. Information and Software Technology, 55(10):1679–1694. [CrossRef]
Daniil A Boiko, Robert MacKnight, and Gabe Gomes. 2023. Emergent autonomous scientific research capabilities of large language models. arXiv preprint arXiv:2304.05332.
Julia Brich, Marcel Walch, Michael Rietzler, Michael Weber, and Florian Schaub. 2017. Exploring end user programming needs in home automation. ACM Transactions on Computer-Human Interaction (TOCHI), 24(2):1–35. [CrossRef]
Robert Bridle and Eric McCreath. 2006. Inducing shortcuts on a mobile phone interface. In Proceedings of the 11th international conference on Intelligent user interfaces, pages 327–329.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
Tom B Brown. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, and Bryan A Plummer. 2021. Mobile app tasks with iterative feedback (motif): Addressing task feasibility in interactive visual environments. arXiv preprint arXiv:2104.08560.
Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Peng Gao, Shuai Ren, and Hongsheng Li. 2024. Amex: Android multi-annotation expo dataset for mobile gui agents. arXiv preprint arXiv:2407.17490.
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. 2024. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45.
Chunyang Chen, Ting Su, Guozhu Meng, Zhenchang Xing, and Yang Liu. 2018. From ui design image to gui skeleton: a neural machine translator to bootstrap mobile gui implementation. In Proceedings of the 40th International Conference on Software Engineering, pages 665–676.
Fei Chen, Wei Ren, et al. 2019. On the control of multi-agent systems: A survey. Foundations and Trends® in Systems and Control, 6(4):339–499. [CrossRef]
Qi Chen, Dileepa Pitawela, Chongyang Zhao, Gengze Zhou, Hsiang-Ting Chen, and Qi Wu. 2024a. Webvln: Vision-and-language navigation on websites. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 1165–1173.
Wei Chen and Zhiyuan Li. 2024. Octopus v2: On-device language model for super agent. arXiv preprint arXiv:2404.01744.
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. 2022. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, et al. 2024b. Guicourse: From general vision language models to versatile gui agents. arXiv preprint arXiv:2406.11317.
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. 2024c. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821. [CrossRef]
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. 2024d. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198.
Zhiyuan Chen, Yaning Li, and Kairui Wang. 2024e. Optimizing reasoning abilities in large language models: A step-by-step approach. Authorea Preprints.
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. 2024. Seeclick: Harnessing gui grounding for advanced visual gui agents. arXiv preprint arXiv:2401.10935.
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, and Kun Shao. 2024. Lightweight neural app control. arXiv preprint arXiv:2410.17883.
Janine Clarke, Judith Proudfoot, Alexis Whitton, Mary-Rose Birch, Megan Boyd, Gordon Parker, Vijaya Manicavasagar, Dusan Hadzi-Pavlovic, Andrea Fogarty, et al. 2016. Therapeutic alliance with a fully automated mobile phone and web-based intervention: secondary analysis of a randomized controlled trial. JMIR mental health, 3(1):e4656. [CrossRef]
Benjamin R Cowan, Nadia Pantidi, David Coyle, Kellie Morrissey, Peter Clarke, Sara Al-Shehri, David Earley, and Natasha Bandeira. 2017. " what can i help you with?" infrequent users’ experiences of intelligent personal assistants. In Proceedings of the 19th international conference on human-computer interaction with mobile devices and services, pages 1–12.
Ishita Dasgupta, Christine Kaeser-Chen, Kenneth Marino, Arun Ahuja, Sheila Babayan, Felix Hill, and Rob Fergus. 2023. Collaborating with language models for embodied reasoning. arXiv preprint arXiv:2302.00763.
Christian Degott, Nataniel P Borges Jr, and Andreas Zeller. 2019. Learning user interface element interactions. In Proceedings of the 28th ACM SIGSOFT international symposium on software testing and analysis, pages 296–306.
Daniel C Dennett. 1988. Précis of the intentional stance. Behavioral and brain sciences, 11(3):495–505.
Parth S Deshmukh, Saroj S Date, Parikshit N Mahalle, and Janki Barot. 2023. Automated gui testing for enhancing user experience (ux): A survey of the state of the art. In International Conference on ICT for Sustainable Development, pages 619–628. Springer.
Jacob Devlin. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Tinghe Ding. 2024. Mobileagent: enhancing mobile control via human-machine interaction and sop integration. arXiv preprint arXiv:2401.04124.
Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. 2024. Self-collaboration code generation via chatgpt. ACM Transactions on Software Engineering and Methodology, 33(7):1–38.
Ali Dorri, Salil S Kanhere, and Raja Jurdak. 2018. Multi-agent systems: A survey. Ieee Access, 6:28573–28593. [CrossRef]
Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Weiwei Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing Dang, et al. 2020. Pp-ocr: A practical ultra lightweight ocr system. arXiv preprint arXiv:2009.09941.
Jianqing Fan, Zhaoran Wang, Yuchen Xie, and Zhuoran Yang. 2020. A theoretical analysis of deep q-learning. In Learning for dynamics and control, pages 486–489. PMLR.
Jingwen Fu, Xiaoyi Zhang, Yuwang Wang, Wenjun Zeng, and Nanning Zheng. 2024. Understanding mobile gui: From pixel-words to screen-sentences. Neurocomputing, 601:128200. [CrossRef]
Kanishk Gandhi, Jan-Philipp Fränken, Tobias Gerstenberg, and Noah Goodman. 2024. Understanding social reasoning in language models with language models. Advances in Neural Information Processing Systems, 36.
Jianfeng Gao, Michel Galley, and Lihong Li. 2018. Neural approaches to conversational ai. In The 41st international ACM SIGIR conference on research & development in information retrieval, pages 1371–1374.
Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. 2024. Navigating the digital world as humans do: Universal visual grounding for gui agents. arXiv preprint arXiv:2410.05243.
Tiago Guerreiro, Ricardo Gamboa, and Joaquim Jorge. 2008. Mnemonical body shortcuts: improving mobile interaction. In Proceedings of the 15th European conference on Cognitive ergonomics: the ergonomics of cool interaction, pages 1–8.
Unmesh Gundecha. 2015. Selenium Testing Tools Cookbook. Packt Publishing Ltd.
Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680.
Thilo Hagendorff. 2023. Machine psychology: Investigating emergent capabilities and behavior in large language models using psychological methods. arXiv preprint arXiv:2303.13988, 1.
Sabrina Haque and Christoph Csallner. 2024. Infering alt-text for ui icons with large language models during app development. arXiv preprint arXiv:2409.18060.
Geoffrey Hecht, Omar Benomar, Romain Rouvoy, Naouel Moha, and Laurence Duchien. 2015. Tracking the software quality of android applications along their evolution (t). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 236–247. IEEE.
Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. 2023. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352.
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. 2024. Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290.
Jakub Hoscilowicz, Bartosz Maj, Bartosz Kozakiewicz, Oleksii Tymoshchuk, and Artur Janicki. 2024. Clickagent: Enhancing ui location capabilities of autonomous agents. arXiv preprint arXiv:2410.11872.
Yongxiang Hu, Xuan Wang, Yingchuan Wang, Yu Zhang, Shiyu Guo, Chaoyi Chen, Xin Wang, and Yangfan Zhou. 2024. Auitestagent: Automatic requirements oriented gui function testing. arXiv preprint arXiv:2407.09018.
Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. 2024. Understanding the planning of llm agents: A survey. arXiv preprint arXiv:2402.02716.
Becky Inkster, Shubhankar Sarda, Vinod Subramanian, et al. 2018. An empathy-driven, conversational artificial intelligence agent (wysa) for digital mental well-being: real-world data evaluation mixed-methods study. JMIR mHealth and uHealth, 6(11):e12106. [CrossRef]
Casper S Jensen, Mukul R Prasad, and Anders Møller. 2013. Automated testing with targeted event sequence generation. In Proceedings of the 2013 International Symposium on Software Testing and Analysis, pages 67–77.
Haolin Jin, Linghan Huang, Haipeng Cai, Jun Yan, Bo Li, and Huaming Chen. 2024. From llms to llm-based agents for software engineering: A survey of current, challenges and future. arXiv preprint arXiv:2408.02479.
Ning Kang, Bharat Singh, Zubair Afzal, Erik M van Mulligen, and Jan A Kors. 2013. Using rule-based natural language processing to improve disease normalization in biomedical text. Journal of the American Medical Informatics Association, 20(5):876–881. [CrossRef]
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
Nikitas Karanikolas, Eirini Manga, Nikoletta Samaridi, Eleni Tousidou, and Michael Vassilakopoulos. 2023. Large language models versus natural language understanding and generation. In Proceedings of the 27th Pan-Hellenic Conference on Progress in Computing and Informatics, pages 278–290.
Courtney Kennedy and Stephen E Everett. 2011. Use of cognitive shortcuts in landline and cell phone surveys. Public Opinion Quarterly, 75(2):336–348.
Veton Kepuska and Gamal Bohouta. 2018. Next-generation of virtual personal assistants (microsoft cortana, apple siri, amazon alexa and google home). In 2018 IEEE 8th annual computing and communication workshop and conference (CCWC), pages 99–103. IEEE.
B Kirubakaran and V Karthikeyani. 2013. Mobile application testing—challenges and solution approach through automation. In 2013 International Conference on Pattern Recognition, Informatics and Mobile Engineering, pages 79–84. IEEE.
Ravi Kishore Kodali and Kopulwar Shishir Mahesh. 2017. Low cost implementation of smart home automation. In 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pages 461–466. IEEE.
Ravi Kishore Kodali, Sasweth C Rajanarayanan, Lakshmi Boppana, Samradh Sharma, and Ankit Kumar. 2019. Low cost smart home automation system using smart phone. In 2019 IEEE R10 humanitarian technology conference (R10-HTC)(47129), pages 120–125. IEEE.
Jürgen Köhl, Rogier Kolnaar, and Willem J Ravensberg. 2019. Mode of action of microbial biological control agents against plant diseases: relevance beyond efficacy. Frontiers in plant science, 10:845. [CrossRef]
Ryuto Koike, Masahiro Kaneko, and Naoaki Okazaki. 2024. Outfox: Llm-generated essay detection through in-context learning with adversarially generated examples. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 21258–21266. [CrossRef]
Pingfan Kong, Li Li, Jun Gao, Kui Liu, Tegawendé F Bissyandé, and Jacques Klein. 2018. Automated testing of android apps: A systematic literature review. IEEE Transactions on Reliability, 68(1):45–66. [CrossRef]
Yavuz Koroglu, Alper Sen, Ozlem Muslu, Yunus Mete, Ceyda Ulker, Tolga Tanriverdi, and Yunus Donmez. 2018. Qbe: Qlearning-based exploration of android applications. In 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST), pages 105–115. IEEE.
Pawel Ladosz, Lilian Weng, Minwoo Kim, and Hyondong Oh. 2022. Exploration in deep reinforcement learning: A survey. Information Fusion, 85:1–22. [CrossRef]
Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, et al. 2024. Autowebglm: A large language model-based web navigating agent. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5295–5306.
Chris Lamberton, Damiano Brigo, and Dave Hoy. 2017. Impact of robotics, rpa and ai on the insurance industry: challenges and opportunities. Journal of Financial Perspectives, 4(1).
Huy Viet Le, Sven Mayer, Maximilian Weiß, Jonas Vogelsang, Henrike Weingärtner, and Niels Henze. 2020. Shortcut gestures for mobile text editing on fully touch sensitive smartphones. ACM Transactions on Computer-Human Interaction (TOCHI), 27(5):1–38. [CrossRef]
Juyong Lee, Taywon Min, Minyong An, Dongyoon Hahm, Haeone Lee, Changyeon Kim, and Kimin Lee. 2024. Benchmarking mobile device control agents across diverse configurations. arXiv preprint arXiv:2404.16660.
Sunjae Lee, Junyoung Choi, Jungjae Lee, Munim Hasan Wasi, Hojun Choi, Steven Y Ko, Sangeun Oh, and Insik Shin. 2023. Explore, select, derive, and recall: Augmenting llm with human-like memory for mobile task automation. arXiv preprint arXiv:2312.03003.
Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023a. Camel: Communicative agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems, 36:51991–52008.
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023b. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR.
Toby Jia-Jun Li, Amos Azaria, and Brad A Myers. 2017. Sugilite: creating multimodal smartphone automation by demonstration. In Proceedings of the 2017 CHI conference on human factors in computing systems, pages 6038–6049.
Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. 2024a. On the effects of data scale on computer control agents. arXiv preprint arXiv:2406.03679.
Xinyi Li, Sai Wang, Siqi Zeng, Yu Wu, and Yi Yang. 2024b. A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges. Vicinagearth, 1(1):9. [CrossRef]
Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. 2020. Mapping natural language instructions to mobile ui action sequences. arXiv preprint arXiv:2005.03776.
Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, et al. 2024c. Personal llm agents: Insights and survey about the capability, efficiency and security. arXiv preprint arXiv:2401.05459.
Yuanchun Li, Ziyue Yang, Yao Guo, and Xiangqun Chen. 2019. Humanoid: A deep learning-based approach to automated black-box android app testing. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1070–1073. IEEE.
Mario Linares-Vásquez, Kevin Moran, and Denys Poshyvanyk. 2017. Continuous, evolutionary and large-scale: A new perspective for automated mobile app testing. In 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 399–410. IEEE.
Xufeng Ling, Ming Gao, and Dong Wang. 2020. Intelligent document processing based on rpa and machine learning. In 2020 Chinese Automation Congress (CAC), pages 1349–1353. IEEE.
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2024a. Visual instruction tuning. Advances in neural information processing systems, 36.
Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, et al. 2024b. Autoglm: Autonomous foundation agents for guis. arXiv preprint arXiv:2411.00820.
Xiaoyi Liu, Yingtian Shi, Chun Yu, Cheng Gao, Tianao Yang, Chen Liang, and Yuanchun Shi. 2023. Understanding in-situ programming for smart home automation. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 7(2):1–31. [CrossRef]
Zhe Liu, Cheng Li, Chunyang Chen, Junjie Wang, Boyu Wu, Yawen Wang, Jun Hu, and Qing Wang. 2024c. Vision-driven automated mobile gui testing via multimodal large language model. arXiv preprint arXiv:2407.03037.
Gio Lodi. 2021. Xctest introduction. In Test-Driven Development in Swift: Compile Better Code with XCTest and TDD, pages 13–25. Springer.
Quanfeng Lu, Wenqi Shao, Zitao Liu, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Yu Qiao, and Ping Luo. 2024a. Gui odyssey: A comprehensive dataset for cross-app gui navigation on mobile devices. arXiv preprint arXiv:2406.08451.
Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. 2024b. Omniparser for pure vision based gui agent. arXiv preprint arXiv:2408.00203.
Ewa Luger and Abigail Sellen. 2016. " like having a really bad pa" the gulf between user expectation and experience of conversational agents. In Proceedings of the 2016 CHI conference on human factors in computing systems, pages 5286–5297.
Fan-Ming Luo, Tian Xu, Hang Lai, Xiong-Hui Chen, Weinan Zhang, and Yang Yu. 2024. A survey on model-based reinforcement learning. Science China Information Sciences, 67(2):121101. [CrossRef]
Xinbei Ma, Zhuosheng Zhang, and Hai Zhao. 2024. Coco-agent: A comprehensive cognitive mllm agent for smartphone gui automation. In Findings of the Association for Computational Linguistics ACL 2024, pages 9097–9110.
Aravind Machiry, Rohan Tahiliani, and Mayur Naik. 2013. Dynodroid: An input generation system for android apps. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, pages 224–234.
Rizwan Majeed, Nurul Azma Abdullah, Imran Ashraf, Yousaf Bin Zikria, Muhammad Faheem Mushtaq, and Muhammad Umer. 2020. An intelligent, secure, and smart home automation system. Scientific Programming, 2020(1):4579291.
Indrani Medhi, Kentaro Toyama, Anirudha Joshi, Uday Athavankar, and Edward Cutrell. 2013. A comparison of list vs. hierarchical uis on mobile phones for non-literate users. In Human-Computer Interaction–INTERACT 2013: 14th IFIP TC 13 International Conference, Cape Town, South Africa, September 2-6, 2013, Proceedings, Part II 14, pages 497–504. Springer.
Anja Meironke and Stephan Kuehnel. 2022. How to measure rpa’s benefits? a review on metrics, indicators, and evaluation methods of rpa benefit assessment.
Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. 2024. Large language models: A survey. arXiv preprint arXiv:2402.06196.
George E Monahan. 1982. State of the art—a survey of partially observable markov decision processes: theory, models, and algorithms. Management science, 28(1):1–16.
Joel Ruben Antony Moniz, Soundarya Krishnan, Melis Ozyildirim, Prathamesh Saraf, Halim Cagri Ates, Yuan Zhang, Hong Yu, and Nidhi Rajshree. 2024. Realm: Reference resolution as language modeling. arXiv preprint arXiv:2403.20329.
Sílvia Moreira, Henrique S Mamede, and Arnaldo Santos. 2023. Process automation using rpa–a literature review. Procedia Computer Science, 219:244–254.
Michel Nass. 2024. On overcoming challenges with GUI-based test automation. Ph.D. thesis, Blekinge Tekniska Högskola.
Michel Nass, Emil Alégroth, and Robert Feldt. 2021. Why many challenges with gui test automation (will) remain. Information and Software Technology, 138:106625. [CrossRef]
Runliang Niu, Jindong Li, Shiqi Wang, Yali Fu, Xiyu Hu, Xueyuan Leng, He Kong, Yi Chang, and Qi Wang. 2024. Screenagent: A vision language model-driven computer control agent. arXiv preprint arXiv:2402.07945.
Songqin Nong, Jiali Zhu, Rui Wu, Jiongchao Jin, Shuo Shan, Xiutian Huang, and Wenhao Xu. 2024. Mobileflow: A multimodal llm for mobile gui agent. arXiv preprint arXiv:2407.04346.
Minxue Pan, An Huang, Guoxin Wang, Tian Zhang, and Xuandong Li. 2020. Reinforcement learning based curiosity-driven testing of android applications. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 153–164.
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22.
Syeh Mujeeb Patel and Syed Jilani Pasha. 2015. Home automation system (has) using android for mobile phone. International Journal Of Scientific Engeneering and Technology Research,, ISSN, pages 2319–8885.
Neha Patil, Dhananjay Bhole, and Prasanna Shete. 2016. Enhanced ui automator viewer with improved android accessibility evaluation features. In 2016 International Conference on Automatic Control and Dynamic Optimization Techniques (ICACDOT), pages 977–983. IEEE.
Pawel Pawlowski, Krystian Zawistowski, Wojciech Lapacz, Marcin Skorupa, Adam Wiacek, Sebastien Postansque, and Jakub Hoscilowicz. 2024. Tinyclick: Single-turn agent for empowering gui automation. arXiv preprint arXiv:2410.11871.
Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, and Thomas Back. 2024. Reasoning with large language models, a survey. arXiv preprint arXiv:2407.11511.
David L Poole and Alan K Mackworth. 2010. Artificial Intelligence: foundations of computational agents. Cambridge University Press.
Dhanya Pramod. 2022. Robotic process automation for industry: adoption status, benefits, challenges and research agenda. Benchmarking: an international journal, 29(5):1562–1586. [CrossRef]
Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. 2024. Agent q: Advanced reasoning and learning for autonomous ai agents. arXiv preprint arXiv:2408.07199.
Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. 2023. Communicative agents for software development. arXiv preprint arXiv:2307.07924, 6(3).
Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. 2024a. Chatdev: Communicative agents for software development. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15174–15186.
Yijun Qian, Yujie Lu, Alexander G Hauptmann, and Oriana Riva. 2024b. Visual grounding for user interfaces. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), pages 97–107.
Alec Radford. 2018a. Improving language understanding by generative pre-training.
Alec Radford. 2018b. Improving language understanding by generative pre-training.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
Okko J Räsänen and Jukka P Saarinen. 2015. Sequence prediction with sparse distributed hyperdimensional coding applied to the analysis of mobile phone use patterns. IEEE transactions on neural networks and learning systems, 27(9):1878–1889. [CrossRef]
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. 2024a. Androidworld: A dynamic benchmarking environment for autonomous agents. arXiv preprint arXiv:2405.14573.
Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. 2024b. Androidinthewild: A large-scale dataset for android device control. Advances in Neural Information Processing Systems, 36.
Alberto Monge Roffarello, Aditya Kumar Purohit, and Satyam V Purohit. 2024. Trigger-action programming for wellbeing: Insights from 6590 ios shortcuts. IEEE Pervasive Computing. [CrossRef]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2024. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36.
Yoav Shoham. 1993. Agent-oriented programming. Artificial intelligence, 60(1):51–92. [CrossRef]
Chloe Sinclair. The role of selenium in mobile application testing.
Shiwangi Singh, Rucha Gadgil, and Ayushi Chudgor. 2014. Automated testing of mobile applications using scripting technique: A study on appium. International Journal of Current Engineering and Technology (IJCET), 4(5):3627–3630.
Paloma Sodhi, SRK Branavan, Yoav Artzi, and Ryan McDonald. 2024. Step: Stacked llm policies for web actions. In First Conference on Language Modeling.
Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. 2023a. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2998–3009.
Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, and Bill Yuchen Lin. 2024a. Trial and error: Exploration-based trajectory optimization for llm agents. arXiv preprint arXiv:2403.02502.
Yunpeng Song, Yiheng Bian, Yongtao Tang, and Zhongmin Cai. 2023b. Navigating interfaces with ai for enhanced user interaction. arXiv preprint arXiv:2312.11190.
Yunpeng Song, Yiheng Bian, Yongtao Tang, Guiyu Ma, and Zhongmin Cai. 2024b. Visiontasker: Mobile task automation using vision based ui understanding and llm task planning. In Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology, pages 1–17.
Zirui Song, Yaohang Li, Meng Fang, Zhenhao Chen, Zecheng Shi, and Yuan Huang. 2024c. Mmac-copilot: Multi-modal agent collaboration operating system copilot. arXiv preprint arXiv:2404.18074.
Matthijs TJ Spaan. 2012. Partially observable markov decision processes. In Reinforcement learning: State-of-the-art, pages 387–414. Springer.
Liangtai Sun, Xingyu Chen, Lu Chen, Tianle Dai, Zichen Zhu, and Kai Yu. 2022. Meta-gui: Towards multi-modal conversational agents on mobile gui. arXiv preprint arXiv:2205.11029.
Rehan Syed, Suriadi Suriadi, Michael Adams, Wasana Bandara, Sander JJ Leemans, Chun Ouyang, Arthur HM Ter Hofstede, Inge Van De Weerd, Moe Thandar Wynn, and Hajo A Reijers. 2020. Robotic process automation: contemporary themes and challenges. Computers in Industry, 115:103162.
Maryam Taeb, Amanda Swearngin, Eldon Schoop, Ruijia Cheng, Yue Jiang, and Jeffrey Nichols. 2024. Axnav: Replaying accessibility tests from natural language. In Proceedings of the CHI Conference on Human Factors in Computing Systems, pages 1–16.
Wrick Talukdar and Anjanava Biswas. 2024. Improving large language model (llm) fidelity through context-aware grounding: A systematic approach to reliability and veracity. arXiv preprint arXiv:2408.04023. [CrossRef]
Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, et al. Cradle: Empowering foundation agents towards general computer control. In NeurIPS 2024 Workshop on Open-World Agents.
Alejandro Torreno, Eva Onaindia, Antonín Komenda, and Michal Štolba. 2017. Cooperative multi-agent planning: A survey. ACM Computing Surveys (CSUR), 50(6):1–32.
Porfirio Tramontana, Domenico Amalfitano, Nicola Amatucci, and Anna Rita Fasolino. 2019. Automated functional testing of mobile applications: a systematic mapping study. Software Quality Journal, 27:149–201.
Alok Mani Tripathi. 2018. Learning Robotic Process Automation: Create Software robots and automate business processes with the leading RPA tool–UiPath. Packt Publishing Ltd.
Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. 2023. On the planning abilities of large language models-a critical investigation. Advances in Neural Information Processing Systems, 36:75993–76005.
Rejin Varghese and M Sambath. 2024. Yolov8: A novel object detection algorithm with enhanced performance and robustness. In 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), pages 1–6. IEEE.
A Vaswani. 2017. Attention is all you need. Advances in Neural Information Processing Systems.
Sagar Gubbi Venkatesh, Partha Talukdar, and Srini Narayanan. 2022. Ugif: Ui grounded instruction following. arXiv preprint arXiv:2211.07615.
Boshi Wang, Xiang Yue, and Huan Sun. 2023a. Can chatgpt defend its belief in truth? evaluating llm reasoning via debate. arXiv preprint arXiv:2305.13160.
Bryan Wang, Gang Li, and Yang Li. 2023b. Enabling conversational interaction with mobile ui using large language models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–17.
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023c. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291.
Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024a. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration. arXiv preprint arXiv:2406.01014.
Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024b. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158.
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024c. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345. [CrossRef]
Luyuan Wang, Yongyu Deng, Yiwei Zha, Guodong Mao, Qinmin Wang, Tianchen Min, Wei Chen, and Shoufa Chen. 2024d. Mobileagentbench: An efficient and user-friendly benchmark for mobile llm agents. arXiv preprint arXiv:2406.08184.
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024e. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191.
Shuai Wang, Weiwen Liu, Jingxuan Chen, Weinan Gan, Xingshan Zeng, Shuai Yu, Xinlong Hao, Kun Shao, Yasheng Wang, and Ruiming Tang. 2024f. Gui agents with foundation models: A comprehensive survey. arXiv preprint arXiv:2411.04890.
Taiyi Wang, Zhihao Wu, Jianheng Liu, Jianye Hao, Jun Wang, and Kun Shao. 2024g. Distrl: An asynchronous distributed reinforcement learning framework for on-device control agents. arXiv preprint arXiv:2410.14803.
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. 2023d. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2024. Autodroid: Llm-powered task automation in android. In Proceedings of the 30th Annual International Conference on Mobile Computing and Networking, pages 543–557.
Hao Wen, Hongming Wang, Jiaxuan Liu, and Yuanchun Li. 2023. Droidbot-gpt: Gpt-powered ui automation for android. arXiv preprint arXiv:2304.07061.
Biao Wu, Yanda Li, Meng Fang, Zirui Song, Zhiwei Zhang, Yunchao Wei, and Ling Chen. 2024. Foundations and recent trends in multimodal mobile agents: A survey. arXiv preprint arXiv:2411.02006.
Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. 2023. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864. [CrossRef]
Yuchen Xia, Manthan Shenoy, Nasser Jazdi, and Michael Weyrich. 2023. Towards autonomous system: flexible modular production system enhanced with large language model agents. In 2023 IEEE 28th International Conference on Emerging Technologies and Factory Automation (ETFA), pages 1–8. IEEE.
Mingzhe Xing, Rongkai Zhang, Hui Xue, Qi Chen, Fan Yang, and Zhen Xiao. 2024. Understanding the weakness of large language model agents within a complex android environment. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 6061–6072.
Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, and Yuxiao Dong. 2024. Androidlab: Training and systematic benchmarking of android autonomous agents. arXiv preprint arXiv:2410.24024.
An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, et al. 2023. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562.
Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. 2023. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.
Yulong Yang, Xinshan Yang, Shuaidong Li, Chenhao Lin, Zhengyu Zhao, Chao Shen, and Tianwei Zhang. 2024. Security matrix for multimodal agents on mobile devices: A systematic and proof of concept study. arXiv preprint arXiv:2407.09295.
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2024. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36.
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. 2023. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe Gan. 2024. Ferret-ui: Grounded mobile ui understanding with multimodal llms. arXiv preprint arXiv:2404.05719.
Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, et al. 2024. Advancing llm reasoning generalists with preference trees. arXiv preprint arXiv:2404.02078.
Samer Zein, Norsaremah Salleh, and John Grundy. 2016. A systematic mapping study of mobile application testing techniques. Journal of Systems and Software, 117:334–356. [CrossRef]
Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, et al. 2024a. Large language model-brained gui agents: A survey. arXiv preprint arXiv:2411.18279.
Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2023a. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771.
Danyang Zhang, Lu Chen, and Kai Yu. 2023b. Mobile-env: A universal platform for training and evaluation of mobile interaction. arXiv preprint arXiv:2305.08144.
Jiayi Zhang, Chuang Zhao, Yihan Zhao, Zhaoyang Yu, Ming He, and Jianping Fan. 2024b. Mobileexperts: A dynamic tool-enabled agent team in mobile devices. arXiv preprint arXiv:2407.03913.
Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. 2024c. Android in the zoo: Chain-of-action-thought for gui agents. arXiv preprint arXiv:2403.02713.
Jiwen Zhang, Yaqi Yu, Minghui Liao, Wentao Li, Jihao Wu, and Zhongyu Wei. 2024d. Ui-hawk: Unleashing the screen stream understanding for gui agents.
Li Zhang, Shihe Wang, Xianqing Jia, Zhihan Zheng, Yunhe Yan, Longxi Gao, Yuanchun Li, and Mengwei Xu. 2024e. Llamatouch: A faithful and scalable testbed for mobile ui automation task evaluation. arXiv preprint arXiv:2404.16054.
Shaoqing Zhang, Zhuosheng Zhang, Kehai Chen, Xinbe Ma, Muyun Yang, Tiejun Zhao, and Min Zhang. 2024f. Dynamic planning for llm-based graphical user interface automation. arXiv preprint arXiv:2410.00467.
Zhuosheng Zhang and Aston Zhang. 2023. You only look at screens: Multimodal chain-of-action agents. arXiv preprint arXiv:2309.11436.
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223.
Yu Zhao, Brent Harrison, and Tingting Yu. 2024. Dinodroid: Testing android apps using deep q-networks. ACM Transactions on Software Engineering and Methodology, 33(5):1–24.
Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. 2024. Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614.

1	https://www.statista.com.
2	https://play.google.com.
3	https://support.apple.com.
4	https://support.apple.com.
5	https://www.apple.com/apple-intelligence/.
6	https://www.vivo.com.cn/originos
7	https://www.honor.com/cn/magic-os/.
8	https://www.anthropic.com/news/3-5-models-and-computer-use
9	https://developer.android.com/reference/android/view/View.

Figure 1. Comparison between conversational LLMs and phone GUI agents. While a conversational LLM can understand queries and provide informative responses (e.g., recommending coffee beans), a Phone GUI agent can go beyond text generation to perceive the device’s interface, decide on an appropriate action (like tapping an app icon), and execute it in the real environment, thus enabling tasks like ordering a latte directly on the user’s phone.

Figure 2. A comprehensive taxonomy of LLM-powered phone GUI agents in phone automation. Note that only a selection of representative works is included in this categorization.

Figure 3. Milestones in the development of LLM-powered phone GUI agents. This figure divides advancements into two primary approaches: Prompt Engineering and Training-Based Methods. Prompt Engineering leverages pre-trained LLMs by strategically crafting input prompts, as detailed in §4.1, to perform specific tasks without modifying model parameters. In contrast, Training-Based Methods, discussed in §4.2, involve adapting LLMs via supervised fine-tuning or reinforcement learning on GUI-specific data, thereby enhancing their ability to understand and interact with mobile UIs.

Figure 4. POMDP model for ordering a latte. Each circle represents a state (e.g., Home Screen, App Homepage, Latte Details Page, Customize Order, Order Confirmation, Order Complete). The agent starts at the initial state

S_{0}

(Home Screen) and makes decisions at each step (e.g., tapping the Starbucks app icon, selecting the "Latte" button, viewing latte details). Due to partial observability, the agent receives limited information at each decision point (e.g.,

O_{0}

: Starbucks app icon visible,

O_{1}

: "Latte" button visible,

O_{2}

: Latte product details visible). Some actions correctly advance towards the goal, while others may cause errors requiring corrections. The final goal is to confirm the order.

Figure 4. POMDP model for ordering a latte. Each circle represents a state (e.g., Home Screen, App Homepage, Latte Details Page, Customize Order, Order Confirmation, Order Complete). The agent starts at the initial state

S_{0}

(Home Screen) and makes decisions at each step (e.g., tapping the Starbucks app icon, selecting the "Latte" button, viewing latte details). Due to partial observability, the agent receives limited information at each decision point (e.g.,

O_{0}

: Starbucks app icon visible,

O_{1}

: "Latte" button visible,

O_{2}

: Latte product details visible). Some actions correctly advance towards the goal, while others may cause errors requiring corrections. The final goal is to confirm the order.

Figure 5. Overview of LLM-powered phone GUI agent framework. The user’s intent, expressed through natural language, is mapped to UI operations. By perceiving UI information and phone state(§3.1) , the agent leverages stored knowledge and memory to plan, reason, and reflect (§3.2) . Finally, it executes actions to fulfill the user’s goals(§3.3).

Figure 6. Comparison of the role-coordinated and scenario-based multi-agent frameworks. The Role-Coordinated framework organizes agents based on general functional roles with a fixed workflow, while the Scenario-Based framework dynamically assigns tasks to specialized agents tailored for specific scenarios, allowing for increased flexibility and adaptability in handling diverse tasks.

Figure 7. Differences between training-based methods and prompt engineering in phone automation. Training-based methods adapt the model’s parameters through additional training, enhancing its ability to perform specific tasks, whereas prompt engineering leverages the existing capabilities of pre-trained models by guiding them with well-designed prompts.

Figure 8. Schematic of prompt engineering for phone automation. The necessary prompt is mandatory, initiating the task, e.g., searching for a Korean restaurant. The optional prompt are supplementary, enhancing tasks without being mandatory. The flexible prompt must include one or more elements from the UI Info, like a screenshot or OCR info, to adapt to task needs.

Figure 9. Comparison between text-based prompt and multimodal prompt. In Text-Based Prompt, the LLM processes textual UI information, such as UI tree structures and OCR data, to determine the action type (index). In contrast, Multimodal Prompt integrates screenshot data with supplementary UI information to facilitate decision-making by the agent. The MLLM can then pinpoint the action location using either coordinates or indices.

Figure 10. Illustration of screen understanding tasks. (a) UI Grounding involves identifying UI elements corresponding to a given description; (b) UI Referring focuses on generating descriptions for specified UI elements; (c) Screen Question Answering requires answering questions based on the content of the screen.

Table 1. Types of actions in phone GUI agents

Action Type	Description
Touch Interactions	Tap: Select a specific UI element. Double Tap: Quickly tap twice to trigger an action. Long Press: Hold a touch for extended interaction, triggering contextual options or menus.
Gesture-Based Actions	Swipe: Move a finger in a direction (left, right, up, down). Pinch: Zoom in/out by bringing fingers together/apart. Drag: Move UI elements to a new location.
Typing and Input	Type Text: Enter text into input fields. Select Text: Highlight text for editing or copying.
System Operations	Launch Application: Open a specific app. Change Settings: Modify system settings (e.g., Wi-Fi, brightness). Navigate Menus: Access app sections or system menus.
Media Control	Play/Pause: Control media playback. Adjust Volume: Increase or decrease device volume.

Table 2. Summary of prompt engineering methods for phone GUI agents

Method	Date	Task Type	Model	Grounding
DroidBot-GPT Wen et al. (2023)	2023.04	General	ChatGPT	Index
Enabling conversational Wang et al. (2023b)	2023.04	Screen Understanding, QA	PaLM	Index
AutoDroid Wen et al. (2024)	2023.09	General	GPT-4, GPT-3.5	Index
MM-Navigator Yan et al. (2023)	2023.11	General	GPT-4V	Index
AppAgent Zhang et al. (2023a)	2023.12	General	GPT-4	Index
VisionTasker Song et al. (2024b)	2023.12	Manual Teaching	GPT-4	Index
MobileGPT Lee et al. (2023)	2023.12	General	GPT-3.5, GPT-4	Index
Mobile-Agent Wang et al. (2024b)	2024.01	General	GPT-4V	Coordinate
AXNav Taeb et al. (2024)	2024.05	Bug Testing	GPT-4	Index
Mobile-Agent-v2 Wang et al. (2024a)	2024.06	General	GPT-4V	Coordinate
MobileExpert Zhang et al. (2024b)	2024.07	General	GPT-4V	Coordinate
VisionDroid Liu et al. (2024c)	2024.07	Non-Crash Functional Bug Detection	GPT-4	Index
OmniParser Lu et al. (2024b)	2024.08	General	GPT-4V	Index

Table 4. Summary of supervised fine-tuning methods for phone GUI agents

Method	Date	Task Type	Backbone	Size	Contributions
Auto-GUI Zhang and Zhang (2023)	2024.01	General	Qwen-VL	9.6B	GUI grounding pre-training; ScreenSpot benchmark
ReALM Moniz et al. (2024)	2024.04	Reference Resolution	FLAN-T5	80M–3B	Formulated reference resolution as language modeling; Improved performance on resolving references
GUICourse Chen et al. (2024b)	2024.06	General	Qwen-VL, Fuyu-8B, MiniCPM-V	N/A	Suite of datasets (GUIEnv, GUIAct, GUIChat); Enhanced OCR and grounding
GUI Odyssey Lu et al. (2024a)	2024.06	General	Qwen-VL	N/A	Cross-app navigation dataset; OdysseyAgent with history resampling
IconDesc Haque and Csallner (2024)	2024.09	Alt-Text Generation	GPT-3.5	N/A	Generated alt-text for UI icons using partial UI data; Improved accessibility
TinyClick Pawlowski et al. (2024)	2024.10	General	Florence-2	0.27B	Single-turn agent; Multitask training; MLLM-based data augmentation

Table 6. Summary of datasets for phone GUI agents

Dataset	Date	Actions	Demos	Apps	Instr.	Avg. Steps	Contributions
PixelHelpLi et al. (2020)	2020.05	4	187	4	187	4.2	Grounding instructions to actions
MoTIF Burns et al. (2021)	2021.04	6	4,707	125	276	4.5	Interactive visual environment
UIBert Bai et al. (2021)	2021.07	N/A	N/A	N/A	16,660	1	Pre-training task
Meta-GUI Sun et al. (2022)	2022.05	7	4,684	11	1,125	5.3	Multi-turn dialogues
UGIF Venkatesh et al. (2022)	2022.11	8	523	12	523	5.3	Multilingual UI-grounded instructions
AITW Rawles et al. (2024b)	2023.12	7	715,142	357	30,378	6.5	Large-scale interactions
AITZ Zhang et al. (2024c)	2024.03	7	18,643	70	2,504	7.5	Chain-of-Action-Thought annotations
GUI Odyssey Lu et al. (2024a)	2024.06	9	7,735	201	7,735	15.4	Cross-app navigation
AndroidControl Li et al. (2024a)	2024.07	8	15,283	833	15,283	4.8	UI task scaling law
AMEX Chai et al. (2024)	2024.07	8	2,946	110	2,946	12.8	Multi-level detailed annotations

Table 7. Summary of benchmarks for phone GUI agents

Benchmark	Date	Tasks
AutoDroid Wen et al. (2024)	2023.09	N/A
MobileEnv Zhang et al. (2023b)	2023.05	74
AndroidArena Xing et al. (2024)	2024.02	N/A
LlamaTouch Zhang et al. (2024e)	2024.04	496
B-MoCA Lee et al. (2024)	2024.04	131
AndroidWorld Rawles et al. (2024a)	2024.05	116
MobileAgentBench Wang et al. (2024d)	2024.06	100
AUITestAgent Hu et al. (2024)	2024.07	N/A
AndroidLab Xu et al. (2024)	2024.10	138

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects

Abstract

Keywords:

Subject:

1. Introduction

2. Development of Phone Automation

2.1. Phone Automation Before the LLM Era

2.1.1. Automation Testing

2.1.2. Shortcuts

2.1.3. Robotic Process Automation

2.2. Challenges of Traditional Methods

2.2.1. Limited Generality

2.2.2. High Maintenance Costs

2.2.3. Poor Intent Comprehension

2.2.4. Weak Screen GUI Perception

2.3. LLMs Boost Phone Automation

2.4. Emerging Commercial Applications

3. Frameworks and Components of Phone GUI Agents

3.1. Perception in Phone GUI Agents

3.1.1. UI Information Perception

3.1.2. Phone State Perception

3.2. Brain in Phone GUI Agents

3.2.1. Storage

3.2.2. Decision Making

3.3. Action in Phone GUI Agents

3.4. Multi-Agent Framework

3.4.1. Role-Coordinated Multi-Agent

3.4.2. Scenario-Based Task Execution

3.5. Plan-Then-Act Framework

4. LLMs for Phone Automation

4.1. Prompt Engineering

4.1.1. Text-Based Prompt

4.1.2. Multimodal Prompt

4.2. Training-Based Methods

4.2.1. Task-Specific LLM-based Agents

4.2.2. Supervised Fine-Tuning

4.2.3. Reinforcement Learning

5. Datasets and Benchmarks

5.1. Datasets

5.2. Benchmarks

5.2.1. Evaluation Pipelines

5.2.2. Evaluation Metrics

6. Challenges and Future Directions

7. Conclusion

References

MDPI Initiatives

Important Links

Subscribe