ALOHa leverages large language models to reliably and localizeably detect object hallucinations in image captions, outperforming prior methods.


coremsg

aloha-a-novel-metric-for-detecting-object-hallucinations-in-image-captions


ALOHa: A Novel Metric for Detecting Object Hallucinations in Image Captions


title_rewrite


A novel language-only training framework, LinCIR, that efficiently learns a projection module to enable zero-shot composed image retrieval without relying on expensive image-text-image triplet datasets.


efficient-language-only-training-for-zero-shot-composed-image-retrieval


Efficient Language-Only Training for Zero-Shot Composed Image Retrieval



The author proposes a SemMIM framework to enhance cross-modal semantic alignment by injecting high-level semantics into local patch encodings and involving text deeply in the MIM process.


Semantics-enhanced-Cross-modal-Masked-Image-Modeling-for-Vision-Language-Pre-training


Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training



The author introduces the All-Seeing Project V2 to improve relation comprehension in vision-language models through a novel task called Relation Conversation (ReC).


The-All-Seeing-Project-V2-Enhancing-Relation-Comprehension-in-Vision-Language-Models


The All-Seeing Project V2: Enhancing Relation Comprehension in Vision-Language Models