핵심 개념
Vision-language models can convert webpage screenshots to HTML code efficiently with the WebSight dataset.
초록
Vision-language models (VLMs) can revolutionize web development by converting screenshots into HTML code. The WebSight dataset, consisting of 2 million pairs of HTML codes and screenshots, enables fine-tuning VLMs for this task. Leveraging synthetic data generation and advanced language models, such as Mistral-7B-Instruct and Deepseek-Coder-33b-instruct, enhances the conversion process. Tailwind CSS simplifies styling in the HTML document, improving model training. Despite successes like Sightseer's proficiency in converting sketches to HTML, challenges remain with complex layouts and image integration. Fine-tuning on WebSight accelerates UI development and fosters AI-powered tools' innovation.
통계
WebSight dataset consists of 2 million pairs of HTML codes and screenshots.
Mistral-7B-Instruct generates diverse website concepts for dataset enrichment.
Deepseek-Coder-33b-instruct is used to generate final HTML codes from creative outputs.
인용구
"By leveraging synthetic data generation and fine-tuning a high-capacity base VLM on the dataset, we demonstrate a viable path to accelerate UI development tasks."
"Sightseer exhibits the capability to generalize beyond its training dataset to websites that differ significantly in appearance."
"We open-source WebSight to foster further innovation and research in automating webpage screenshot conversion."