The VISION2UI dataset was constructed by the authors to address the limitations of existing datasets for training MLLMs on the task of generating UI code from design images. The key highlights of the dataset are:
Data Collection: The dataset was extracted from the Common Crawl dataset, which contains a vast collection of real-world web pages. The authors downloaded the corresponding CSS and image elements, and then cleaned the HTML code by removing redundant elements and applying length filters.
Screenshots Generation: The authors used Pyppeteer to generate screenshots of the cleaned web pages and simultaneously captured the layout information (size and position) of each HTML element.
Filtering with Neural Scorer: To further improve the quality of the dataset, the authors trained a neural network model as a scorer to assess the aesthetics and completeness of the screenshots. Data points with low scores were filtered out.
Dataset Statistics: The final VISION2UI dataset consists of 20,000 (with much more coming soon) parallel samples of HTML code and UI design images, along with the layout information. Compared to existing datasets, VISION2UI exhibits significantly greater diversity in terms of HTML structure, element types, layout, and coloration.
The authors discuss the practical challenges of automatic HTML code generation from design images, such as the complexity of CSS and the difficulty in accurately capturing the HTML DOM tree structure. They plan to further enhance the dataset and explore effective model architectures to address these challenges.
翻译成其他语言
从原文生成
arxiv.org
更深入的查询