The author proposes Polos, a supervised automatic evaluation metric for image captioning models, utilizing a parallel feature extraction mechanism and human feedback. The approach aims to address the limitations of existing metrics by incorporating multimodal inputs and large-scale contrastive learning.
Polos is a novel automatic evaluation metric for image captioning models that outperforms existing metrics by leveraging multimodal inputs and human feedback.