ALOHa leverages large language models to reliably and localizeably detect object hallucinations in image captions, outperforming prior methods.
A novel language-only training framework, LinCIR, that efficiently learns a projection module to enable zero-shot composed image retrieval without relying on expensive image-text-image triplet datasets.
The author proposes a SemMIM framework to enhance cross-modal semantic alignment by injecting high-level semantics into local patch encodings and involving text deeply in the MIM process.
The author introduces the All-Seeing Project V2 to improve relation comprehension in vision-language models through a novel task called Relation Conversation (ReC).