This survey paper provides a comprehensive overview of existing datasets and algorithms in the field of Visual Question Answering (VQA), categorizing and analyzing their strengths, weaknesses, and specific focuses.
This paper proposes a novel graph-based multimodal commonsense knowledge distillation framework to enhance Visual Question Answering (VQA) by integrating commonsense knowledge, visual features, and question representations into a unified graph structure processed by a Graph Convolutional Network (GCN).
This paper introduces SimpsonsVQA, a novel dataset based on The Simpsons cartoon imagery, designed to advance Visual Question Answering (VQA) research beyond photorealistic images and address challenges in question relevance and answer correctness assessment, particularly for educational applications.
This paper introduces ChitroJera, a new large-scale, culturally relevant visual question answering (VQA) dataset for the Bangla language, addressing the lack of such resources and enabling the development of more effective VQA models for this under-resourced language.
EchoSight, a novel retrieval-augmented vision-language system, excels in knowledge-based visual question answering by employing a dual-stage search mechanism that integrates visual-only retrieval with multimodal reranking, significantly improving accuracy over existing VLMs.
DIETCOKE, a novel method for zero-shot knowledge-based visual question answering (VQA), leverages the strengths of multiple question-answering strategies and rationale-based ensembles to achieve state-of-the-art performance on challenging K-VQA datasets.
Visual Question Answering (VQA) is a rapidly evolving field that combines elements of computer vision and natural language processing to generate answers to questions about visual inputs. This survey provides a comprehensive overview of the VQA domain, including its applications, problem definitions, datasets, methods, and emerging trends.
The RankVQA model leverages a ranking-inspired hybrid training strategy and sophisticated multimodal fusion techniques to significantly enhance the performance of Visual Question Answering systems.
Employing convolutional layers to extract multi-scale local textual features can improve performance on Visual Question Answering tasks compared to complex sequential models.
This study explores innovative methods, including Generative Adversarial Networks (GANs), autoencoders, and attention mechanisms, to improve the performance of Visual Question Answering (VQA) systems.