This paper introduces a rich human feedback dataset (RichHF-18K) and a multimodal transformer model (RAHF) to provide detailed and interpretable evaluations of text-to-image generation models. The dataset contains fine-grained scores, implausibility/misalignment image regions, and misaligned keywords, which can be used to train the RAHF model to automatically predict such rich feedback on generated images.


coremsg

rich-human-feedback-dataset-and-model-for-evaluating-text-to-image-generation


Rich Human Feedback Dataset and Model for Evaluating Text-to-Image Generation