Rich Human Feedback Dataset and Model for Evaluating Text-to-Image Generation
This paper introduces a rich human feedback dataset (RichHF-18K) and a multimodal transformer model (RAHF) to provide detailed and interpretable evaluations of text-to-image generation models. The dataset contains fine-grained scores, implausibility/misalignment image regions, and misaligned keywords, which can be used to train the RAHF model to automatically predict such rich feedback on generated images.