核心概念
TableVQA-Bench is a comprehensive benchmark designed to evaluate the table visual question answering capabilities of multi-modal large language models.
要約
The authors establish the TableVQA-Bench, a new benchmark for evaluating table visual question answering (TableVQA) capabilities. The benchmark is constructed by leveraging existing table-related datasets, including table question-answering (TableQA) and table structure recognition (TSR) datasets.
The key components of TableVQA-Bench are:
- Table images: Obtained by applying stylesheets to HTML or using a proposed table rendering system.
- Text representations (HTML): Maintaining the content and style of the tables.
- Question-answer (QA) pairs: Generated using large language models (LLMs) for datasets without existing QA pairs.
The authors conduct comprehensive evaluations of various multi-modal large language models (MLLMs) on the TableVQA-Bench. They find that GPT-4V outperforms other commercial and open-sourced MLLMs across all table domains. The experiments also reveal that preserving the original visual features of tables is crucial for TableVQA performance. Additionally, the authors investigate the capabilities of MLLMs compared to their LLM backbones by presenting image-formatted and text-formatted tables, respectively. The results suggest that processing visual inputs is more challenging than text inputs for MLLMs.
統計
The TableVQA-Bench dataset contains a total of 894 images and 1,500 QA pairs.
The distribution analysis shows that the length of questions in FinTabNetQA is generally longer than other datasets, and the answer length distribution for VTabFact exhibits two distinctive categories of "true" or "false" responses.
The number of rows and aspect ratio of tables are correlated, with VWTQ having numerous tables with lengthy rows and FinTabNetQA often exhibiting larger aspect ratios.
The length of vision tokens in MLLMs varies widely, ranging from 32 to 1,445, and efficiency of image-formatted tables decreases significantly when the length exceeds 1,000 tokens.
引用
"The primary objective of this paper is to obtain these necessary components. Specifically, images are sourced either through the application of a stylesheet or by employing the proposed table rendering system."
"Through comparisons among MLLMs on TableVQA-Bench, we found that GPT-4V outperforms other methods including commercial and open-sourced models across all table domains."
"Our findings suggest that processing visual inputs is more challenging than text inputs, as evidenced by the lower performance of MLLMs, despite generally requiring higher computational costs than LLMs."