核心概念
A novel vision-language model called UIClip that outperforms existing text-only and CLIP-based approaches in retrieving relevant mobile GUI screenshots from a large repository using textual queries.
摘要
The paper proposes GUing, a GUI search engine based on a vision-language model called UIClip, which is trained on a large dataset of mobile app screenshots and captions.
Key highlights:
- The authors created two large datasets - GPSRepo (303k screenshots) and GPSCap (135k screenshot-caption pairs) - by mining app introduction images from Google Play.
- They fine-tuned the CLIP model on the GPSCap dataset to create UIClip, a vision-language model tailored for the GUI domain.
- Evaluation shows that UIClip outperforms text-only and generic CLIP-based approaches in text-to-GUI retrieval tasks, achieving a Recall@10 of up to 0.69 and a HIT@10 of 0.91.
- The GUing search engine, built on top of UIClip and the GUI repository, delivers highly relevant search results compared to baseline approaches.
- The authors also explore the performance of UIClip on other GUI-related tasks like sketch-to-GUI retrieval and GUI classification.
统计
"The GPSRepo dataset comprises 303k screenshots, out of which 135k have captions."
"UIClip achieves a Recall@10 of up to 0.69 and a HIT@10 of 0.91 in text-to-GUI retrieval tasks."
"GUing outperforms the baseline RaWi search engine, with a P@10 of 0.343 compared to 0.214, and a HIT@10 of 0.914 compared to 0.701."
引用
"Recent Vision-Language Models (VLMs), such as CLIP [49], BLIP [34], and BLIP-2 [33], are trained in large-scale image-caption data by contrastive learning. These models have the ability to transform images and text into a multimodal embedding, ensuring that semantically similar images and texts are mapped closely in the embedding space."
"By leveraging the GPSCap, in conjunction with Screen2Words and Clarity datasets, we fine tuned the CLIP model to create a VLM specific to the GUI domain. We call the new model UIClip."