The paper introduces a new dataset and framework for computer user interface (UI) understanding. It addresses the lack of focus on complete computer UIs in previous works, emphasizing the complexity and variability of computer interfaces compared to web and mobile applications. The dataset includes videos capturing user actions on a computer screen, aiming to automate workflow processes by understanding the state of computation from images. The proposed framework, UI Multi-task Contrastive Learning (UIMTCon), combines synthetic sample generation with contrastive learning to classify images in videos accurately. Experimental results show improved performance over baseline methods in fine-grain UI classification.
To Another Language
from source content
arxiv.org
Deeper Inquiries