Enhancing Zero-Shot Vision-Language Reasoning with Image-Conditioned Text Correction
Introducing a novel pre-training task, Image-Conditioned Caption Correction (ICCC), to enhance the zero-shot generalization capabilities of vision-language models without the need for labeled downstream task data.