Mitigating Open-Vocabulary Hallucinations in Image Captioning Models
Existing image captioning models suffer from the issue of hallucinations, generating spurious details that cannot be inferred from the input image. This work proposes a framework to address hallucinations in the open-vocabulary setting, including a new benchmark (OpenCHAIR) to evaluate open-vocabulary object hallucinations, and an optimization-based approach (MOCHa) to mitigate such hallucinations without relying on a closed object list.