Grunnleggende konsepter
The authors present ImplicitAVE, the first publicly available multimodal dataset for implicit attribute value extraction, and establish a comprehensive benchmark for evaluating multimodal large language models on this task.
Sammendrag
The authors present ImplicitAVE, a new multimodal dataset for implicit attribute value extraction (AVE). Existing AVE datasets predominantly focus on explicit attribute values and lack product images, are often not publicly available, and lack human inspection across diverse domains.
To address these limitations, the authors:
- Curated and expanded the MAVE dataset to create ImplicitAVE, a refined dataset of 68k training and 1.6k testing instances across five domains, 25 attributes, and 158 attribute values, with a focus on implicit AVE and multimodality.
- Established a comprehensive benchmark for evaluating multimodal large language models (MLLMs) on the ImplicitAVE dataset, covering six recent MLLMs with 11 variants. The results reveal that implicit value extraction remains a challenging task for open-source MLLMs.
- Conducted in-depth analysis on the domain-level and attribute-level performance of the evaluated models, identifying key challenges and opportunities for future research.
The authors found that the Clothing domain and attributes like Sleeve Style and Neckline are the most challenging for the evaluated models. They also observed that while GPT-4V outperformed other models, open-source MLLMs still lag behind in many domains and attributes, providing opportunities for further research.
Statistikk
"Implicit values can only be inferred from the product image, contextual clues, or prior knowledge."
"Our dataset covers 5 diverse domains and 25 carefully curated attributes specially for the task of implicit attribute value extraction."
"We have a total of 158 diverse attribute values."
Sitater
"Existing datasets for attribute value extraction exhibit several key limitations: (1) They predominantly focus on explicit attribute values, neglecting implicit attribute values (Zheng et al., 2018; Wang et al., 2020), which are more challenging and commonly encountered in real-world scenarios; (2) Many datasets lack product images (Yan et al., 2021; Yang et al., 2022), limiting their applicability in multimodal contexts; (3) The limited number of publicly available datasets lack human inspection and cover only a few domains, resulting in inaccurate and restricted benchmarks (Xu et al., 2019; Zhang et al., 2023)."
"GPT-4V outperformed every other model in both the zero-shot and fine-tune setting in every single domain."
"Among the open-source MLLMs, no single model outperformed all other models across all the domains, but Qwen-VL had the best scores in the Jewelry&GA and Food domains."