A highly effective retrieval-augmented image captioning method that prompts large language models with object names retrieved from an external visual-name memory to enable open-world comprehension.
An unsupervised method for enhancing image captioning models using reinforcement learning and vision-language models as reward models, leading to more detailed and comprehensive image descriptions.
The core message of this paper is to propose a new text data-centric approach with interactive prompts for image captioning, named TIPCap, which provides a unified solution for different data configurations, mitigates the modality gap effectively, and allows incorporating optional prompt information to generate higher-quality descriptions.
Proposing a novel method for semi-supervised image captioning using Wasserstein Graph Matching to efficiently utilize undescribed images.
DECap proposes a novel diffusion-based method for explicit caption editing, showcasing strong generalization ability and potential for improving caption generation quality.
Polos is a novel automatic evaluation metric for image captioning models that outperforms existing metrics by leveraging multimodal inputs and human feedback.
MeaCap proposes a novel Memory-Augmented framework for zero-shot image captioning, achieving state-of-the-art performance by integrating textual memory and visual-related fusion scores.
The author proposes Polos, a supervised automatic evaluation metric for image captioning models, utilizing a parallel feature extraction mechanism and human feedback. The approach aims to address the limitations of existing metrics by incorporating multimodal inputs and large-scale contrastive learning.