Controlling Smartphones with Vision-Language Models: Enabling Intuitive Device Interaction through Screen-Based Commands
A novel vision-language model capable of executing diverse user instructions on mobile devices by interacting solely with the user interface, leveraging both visual and textual inputs.