Core Concepts
Introducing the first fully authentic VQA dataset sourced from online communities and analyzing its unique features and challenges for modern VQA models.
Abstract
The content introduces the VQAonline dataset sourced from Stack Exchange, highlighting its authenticity and lengthy answers. It compares this dataset with existing ones, emphasizing differences in context and answer length. The analysis includes dataset creation, user intentions taxonomy, model benchmarking, human evaluation, and correlation with evaluation metrics.
Introduction:
Introduces the VQAonline dataset sourced from online question answering platforms.
Highlights the authenticity of the dataset and its unique features compared to existing datasets.
Related Work:
Discusses existing VQA datasets' limitations in authenticity and diversity of content.
Compares VizWiz-VQA as the only authentic use case dataset to VQAonline.
Dataset Creation:
Details the creation process of the VQAonline dataset from Stack Exchange data.
Explains filtering steps to obtain a final dataset of 64,696 visual questions.
User Intentions Taxonomy:
Describes the taxonomy development process for user intentions in visual questions.
Presents final categories: advice, evidence, identification, instruction, opinion, reason, verification.
Model Benchmarking:
Evaluates six modern Vision and Language Models on the VQAonline dataset using popular evaluation metrics for long-form text.
Discusses model performance challenges and opportunities for improvement.
Human Evaluation:
Conducts a human study to assess model performance qualitatively.
Correlates human judgments with quantitative evaluation metrics to evaluate alignment.
Supplementary Materials:
Provides additional details on dataset collection processes, user intention annotations, and task design.
Stats
Sourced from online question answering platforms; 64,696 visual questions collected from Stack Exchange data.
Answers tend to be much longer (mean of 173 words) compared to standard VQA datasets.
Popular metrics used for evaluating six state-of-the-art VQA models on VQAonline.
Human annotators assigned primary intent per visual question based on eight categories: advice, evidence, identification, instruction, opinion, reason, verification.
Correlation between human judgment and six evaluation metrics analyzed for 200 VQAs.
Quotes
"Observing that answers in our dataset tend to be much longer (i.e., a mean of 173 words)..." - Content
"To facilitate future extensions... https://vqaonline.github.io/" - Content
"Our findings reveal commonalities of our VQAonline dataset with existing datasets." - Content