The paper introduces a new dataset and framework for computer user interface (UI) understanding. It addresses the lack of focus on complete computer UIs in previous works, emphasizing the complexity and variability of computer interfaces compared to web and mobile applications. The dataset includes videos capturing user actions on a computer screen, aiming to automate workflow processes by understanding the state of computation from images. The proposed framework, UI Multi-task Contrastive Learning (UIMTCon), combines synthetic sample generation with contrastive learning to classify images in videos accurately. Experimental results show improved performance over baseline methods in fine-grain UI classification.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Andr... lúc arxiv.org 03-18-2024
https://arxiv.org/pdf/2403.10170.pdfYêu cầu sâu hơn