المفاهيم الأساسية
This paper introduces a rich human feedback dataset (RichHF-18K) and a multimodal transformer model (RAHF) to provide detailed and interpretable evaluations of text-to-image generation models. The dataset contains fine-grained scores, implausibility/misalignment image regions, and misaligned keywords, which can be used to train the RAHF model to automatically predict such rich feedback on generated images.
الملخص
The paper addresses the limitations of existing text-to-image evaluation metrics, which often summarize image quality into a single numeric score and do not provide detailed insights. The authors collected the RichHF-18K dataset, which contains rich human feedback on 18K generated images, including:
- Point annotations on the image to highlight regions of implausibility/artifacts and text-image misalignment.
- Labeled words on the prompts specifying the missing or misrepresented concepts in the generated image.
- Four types of fine-grained scores for image plausibility, text-image alignment, aesthetics, and overall rating.
The authors then designed a multimodal transformer model called RAHF to automatically predict this rich human feedback on generated images. RAHF outperforms baseline models in predicting the fine-grained scores, implausibility/misalignment heatmaps, and misaligned keywords.
The authors further demonstrate the usefulness of the predicted rich human feedback by RAHF to improve image generation. They show that using the predicted heatmaps as masks to inpaint problematic image regions, and using the predicted scores to help finetune image generation models (like Muse), can lead to better images than the original models. Notably, the improvements generalize to models (Muse) beyond those used to generate the images on which human feedback data were collected (Stable Diffusion variants).
الإحصائيات
"Only ~10% of the generated images in the Pick-a-Pic dataset are free of artifacts and implausibility."
"In the Pick-a-Pic dataset, many images contain distorted human/animal bodies, distorted objects and implausibility issues such as a floating lamp."
اقتباسات
"Existing automatic evaluation metrics for generated images, however, including the well-known IS and FID, are computed over distributions of images and may not reflect nuances in individual images."
"Recent research has collected human preferences/ratings to evaluate the quality of generated images and trained evaluation models to predict those ratings, notably ImageReward or Pick-a-Pic. While more focused, these metrics still summarize the quality of one image into a single numeric score."