This survey presents a comprehensive analysis of the Visual Question Answering (VQA) domain. It begins by exploring the various applications of VQA, such as assisting the visually impaired, medical diagnosis, education, and visual chatbots. The survey then defines the scope and problem statement of VQA, highlighting its evolution from single-image question answering to generalized visual inputs.
The survey then reviews existing surveys in the VQA domain, categorizing them into generalized and specialized surveys. Generalized surveys provide a broad overview of the field, while specialized surveys delve deeper into specific aspects, such as fusion techniques, language bias, video QA, and medical VQA.
The core of the survey focuses on VQA datasets, methods, and metrics. It discusses the evolution of VQA datasets, from the early traditional datasets to the more recent knowledge-based, reasoning, and bias reduction datasets. The survey also examines the progression of VQA methods, from early deep learning-based approaches to the contemporary vision-language pre-training techniques.
Furthermore, the survey positions VQA within the broader context of multimodal learning, exploring related domains and sub-domains, such as image captioning, visual dialogue, and embodied QA. Finally, the survey highlights the current trends, open problems, and future directions in the VQA domain, emphasizing the potential for groundbreaking research.
לשפה אחרת
מתוכן המקור
arxiv.org
תובנות מפתח מזוקקות מ:
by Md Farhan Is... ב- arxiv.org 09-25-2024
https://arxiv.org/pdf/2311.00308.pdfשאלות מעמיקות