Centrala begrepp
Visual Question Answering (VQA) is a rapidly evolving field that combines elements of computer vision and natural language processing to generate answers to questions about visual inputs. This survey provides a comprehensive overview of the VQA domain, including its applications, problem definitions, datasets, methods, and emerging trends.
Sammanfattning
This survey presents a comprehensive analysis of the Visual Question Answering (VQA) domain. It begins by exploring the various applications of VQA, such as assisting the visually impaired, medical diagnosis, education, and visual chatbots. The survey then defines the scope and problem statement of VQA, highlighting its evolution from single-image question answering to generalized visual inputs.
The survey then reviews existing surveys in the VQA domain, categorizing them into generalized and specialized surveys. Generalized surveys provide a broad overview of the field, while specialized surveys delve deeper into specific aspects, such as fusion techniques, language bias, video QA, and medical VQA.
The core of the survey focuses on VQA datasets, methods, and metrics. It discusses the evolution of VQA datasets, from the early traditional datasets to the more recent knowledge-based, reasoning, and bias reduction datasets. The survey also examines the progression of VQA methods, from early deep learning-based approaches to the contemporary vision-language pre-training techniques.
Furthermore, the survey positions VQA within the broader context of multimodal learning, exploring related domains and sub-domains, such as image captioning, visual dialogue, and embodied QA. Finally, the survey highlights the current trends, open problems, and future directions in the VQA domain, emphasizing the potential for groundbreaking research.
Statistik
"Visual Question Answering (VQA) has been traditionally defined as the problem of answering a question with an image as the context [1]."
"The current scope of VQA is not limited to a single image as the visual input but can be generalized to any form of visual input e.g. set of images [2] or videos [3, 4]."
"The VQA methodologies have also undergone several phases but have permanently shifted to deep learning-based methods."
Citat
"Visual Question Answering (VQA) is a rapidly evolving field that combines elements of computer vision and natural language processing to generate answers to questions about visual inputs."
"The survey also examines the progression of VQA methods, from early deep learning-based approaches to the contemporary vision-language pre-training techniques."
"The survey positions VQA within the broader context of multimodal learning, exploring related domains and sub-domains, such as image captioning, visual dialogue, and embodied QA."