แนวคิดหลัก
Developing autonomous agents for mobile device control can significantly enhance user interactions, but the lack of a commonly adopted benchmark makes it challenging to quantify scientific progress in this area. This work introduces B-MoCA, a novel benchmark designed specifically for evaluating mobile device control agents across diverse Android configurations.
บทคัดย่อ
The authors introduce B-MoCA, a benchmark designed to evaluate the performance of mobile device control agents on diverse Android device configurations. The key features of B-MoCA include:
- It is based on the Android operating system, providing a realistic environment for evaluating agents.
- It defines 60 common daily tasks that involve commonly used applications like Chrome and Calendar, ensuring relevance to everyday life.
- It incorporates a randomization feature that changes various aspects of mobile devices, including user interface layouts, wallpapers, languages, and device types, to assess the generalization performance of agents.
- It provides rule-based success detectors to reliably evaluate the agents' performance in completing the tasks.
The authors benchmark three types of agents: LLM agents, MLLM agents, and Vision-Language-UI (VLUI) agents. LLM agents and MLLM agents utilize foundation models like LLMs and MLLMs, respectively, while VLUI agents are trained from scratch using human expert demonstrations.
The experiments reveal that the agents exhibit fundamental skills in mobile device control, such as solving straightforward tasks or completing tasks in training environments. However, they struggle in more challenging scenarios, such as handling more difficult tasks or generalizing to unseen device configurations. The authors analyze the strengths and limitations of each agent type and discuss the effects of different design choices, such as the use of pre-trained encoders and training data diversity for VLUI agents.
The authors open-source the source code and relevant materials for B-MoCA, aiming to help future researchers identify challenges in building assistive agents and easily compare the efficacy of their methods over the prior work.
สถิติ
"Developing autonomous agents for mobile devices can significantly enhance user interactions by offering increased efficiency and accessibility."
"To create a realistic benchmark, we develop B-MoCA based on the Android operating system and define 60 common daily tasks."
"We benchmark diverse agents, including agents employing large language models (LLMs) or multi-modal LLMs (MLLMs) as well as agents trained from scratch using human expert demonstrations."
คำพูด
"While these agents demonstrate proficiency in executing straightforward tasks, their poor performance on complex tasks highlights significant opportunities for future research to enhance their effectiveness."
"Our source code is publicly available at https://b-moca.github.io."