The paper introduces UIClip, a computational model for assessing the design quality and visual relevance of user interface (UI) screenshots based on their natural language descriptions.
To train UIClip, the authors developed a large-scale dataset called JitterWeb, which contains over 2.3 million UI screenshots paired with synthetic descriptions that include design quality tags and identified design defects. They also collected a smaller human-rated dataset called BetterApp, where professional designers provided relative rankings and design feedback on UI screenshots.
UIClip is built upon the CLIP vision-language model, but the authors found that off-the-shelf CLIP models perform poorly on UI design assessment tasks. To address this, they finetuned CLIP using the JitterWeb and BetterApp datasets, incorporating a pairwise contrastive objective to better distinguish good and bad UI designs.
Evaluation results show that UIClip outperforms several large vision-language model baselines on three key tasks: 1) accurately identifying the better UI design from a pair, 2) generating relevant design suggestions based on detected flaws, and 3) retrieving UI examples that match a given natural language description. The authors also present three example applications that demonstrate how UIClip can be used to facilitate downstream UI design tasks, such as quality-aware UI code generation, design recommendation, and example retrieval.
翻译成其他语言
从原文生成
arxiv.org
更深入的查询