The paper presents the RankVQA model, which aims to improve the performance of Visual Question Answering (VQA) tasks. The key highlights are:
Visual Feature Extraction: The model uses the Faster R-CNN architecture to extract high-quality visual features from images, capturing detailed object-level information.
Text Feature Extraction: The model utilizes the pre-trained BERT model to extract rich semantic features from the question text, enabling better understanding of the natural language input.
Multimodal Fusion: The model employs a multi-head self-attention mechanism to dynamically integrate the visual and textual features, allowing it to capture complex interactions between the two modalities.
Ranking Learning Module: The model incorporates a ranking learning module that optimizes the relative ranking of answer candidates, ensuring the correct answers are ranked higher than incorrect ones.
Hybrid Training Strategy: The model is trained using a combination of classification loss and ranking loss, enhancing its generalization ability and robustness across diverse datasets.
Experimental results on the VQA v2.0 and COCO-QA datasets demonstrate that the RankVQA model significantly outperforms existing state-of-the-art VQA models, achieving an accuracy of 71.5% and 72.3% respectively, and Mean Reciprocal Rank (MRR) of 0.75 and 0.76. The ablation study further highlights the importance of each key component in the model's superior performance.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Peiyuan Chen... lúc arxiv.org 09-24-2024
https://arxiv.org/pdf/2408.07303.pdfYêu cầu sâu hơn