Image Captioning

Log på

indsigt - Image Captioning

EVCAP: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension

A highly effective retrieval-augmented image captioning method that prompts large language models with object names retrieved from an external visual-name memory to enable open-world comprehension.

VLRM: Enhancing Image Captioning with Vision-Language Models as Reward Models

An unsupervised method for enhancing image captioning models using reinforcement learning and vision-language models as reward models, leading to more detailed and comprehensive image descriptions.

Text-Centric Image Captioning with Interactive Prompts

The core message of this paper is to propose a new text data-centric approach with interactive prompts for image captioning, named TIPCap, which provides a unified solution for different data configurations, mitigates the modality gap effectively, and allows incorporating optional prompt information to generate higher-quality descriptions.

Semi-Supervised Image Captioning with Wasserstein Graph Matching

Proposing a novel method for semi-supervised image captioning using Wasserstein Graph Matching to efficiently utilize undescribed images.

DECap: Towards Generalized Explicit Caption Editing via Diffusion Mechanism

DECap proposes a novel diffusion-based method for explicit caption editing, showcasing strong generalization ability and potential for improving caption generation quality.

Polos: Multimodal Metric Learning for Image Captioning Evaluation

Polos is a novel automatic evaluation metric for image captioning models that outperforms existing metrics by leveraging multimodal inputs and human feedback.

Memory-Augmented Zero-shot Image Captioning Framework: MeaCap

MeaCap proposes a novel Memory-Augmented framework for zero-shot image captioning, achieving state-of-the-art performance by integrating textual memory and visual-related fusion scores.

Polos: Multimodal Metric Learning for Image Captioning

The author proposes Polos, a supervised automatic evaluation metric for image captioning models, utilizing a parallel feature extraction mechanism and human feedback. The approach aims to address the limitations of existing metrics by incorporating multimodal inputs and large-scale contrastive learning.

Produkter

Ressourcer