Core Concepts
The Latent Prompt Assist (LaPA) model leverages latent prompts to filter and extract clinically relevant information from uni-modal and multi-modal features, enabling improved performance in medical visual question answering tasks.
Abstract
The paper presents the Latent Prompt Assist (LaPA) model for medical visual question answering (Med-VQA). The key components of the LaPA model are:
-
Latent Prompt Generation Module:
- Generates a latent prompt that is constrained by the target answer tokens to focus on relevant information.
- The latent prompt is designed to interact with the total answer tokens to extract clinically relevant information.
-
Multi-Modal Fusion Block:
- Utilizes the latent prompt to filter different modal information (image, language, and multi-modal) and extract clinically relevant details.
- The latent prompt is fused with uni-modal and multi-modal features through a sequential cross-attention mechanism.
-
Prior Knowledge Fusion Module:
- Incorporates prior knowledge about the relationships between organs and diseases from a knowledge graph.
- The prior knowledge is integrated with the latent prompt-based integrated information to further enhance the final answer prediction.
The experimental results on three publicly available Med-VQA datasets (VQA-RAD, SLAKE, and VQA-2019) demonstrate that the LaPA model outperforms state-of-the-art approaches, achieving improvements of 1.83%, 0.63%, and 1.80% in overall accuracy, respectively. The ablation study further highlights the contributions of the individual components of the LaPA model, showcasing the effectiveness of the latent prompt mechanism and the integration of prior knowledge.
Stats
The paper does not provide any specific numerical data or metrics in the main text. The results are reported in terms of overall accuracy percentages on the three benchmark datasets.
Quotes
The paper does not contain any direct quotes that are particularly striking or supportive of the key logics.