Utilizing Visual Question Answering to Guide Multimodal Pre-training for Improved Medical Insights
The authors propose a novel multimodal pre-training framework that utilizes visual question answering (VQA) to guide the model in focusing on desired pathological features without requiring additional expert annotations. The framework also includes a quasi-textual feature transformer module to narrow the vision-language gap and facilitate modality alignment.