The paper introduces a new dataset and framework for computer user interface (UI) understanding. It addresses the lack of focus on complete computer UIs in previous works, emphasizing the complexity and variability of computer interfaces compared to web and mobile applications. The dataset includes videos capturing user actions on a computer screen, aiming to automate workflow processes by understanding the state of computation from images. The proposed framework, UI Multi-task Contrastive Learning (UIMTCon), combines synthetic sample generation with contrastive learning to classify images in videos accurately. Experimental results show improved performance over baseline methods in fine-grain UI classification.
A otro idioma
del contenido fuente
arxiv.org
Consultas más profundas