Bongard-OpenWorld presents a new benchmark for evaluating few-shot reasoning in machine vision. It focuses on real-world visual concepts and challenges current algorithms. The benchmark is based on the classical Bongard Problems but adds open-world free-form concepts and real-world images. The goal is to identify visual concepts exclusively depicted by positive images and make binary predictions on query images. Various approaches, including Large Language Models (LLMs) and Vision-Language Models (VLMs), have been tested, but none have closed the human-machine gap. The dataset includes diverse visual concepts extracted from Conceptual Captions and crowd-sourced challenging concepts. Each problem consists of positive and negative sets with distractors to increase difficulty. The statistics show a wide range of concept lengths and a long-tailed distribution of words. Several models have been evaluated, with SNAIL showing promising results but still falling short of human performance.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Rujie Wu,Xia... at arxiv.org 03-05-2024
https://arxiv.org/pdf/2310.10207.pdfDeeper Inquiries