The FunQA dataset introduces challenging tasks to evaluate models' understanding of surprising videos, focusing on humor, creativity, and magic. It emphasizes counter-intuitive reasoning and deep video comprehension capabilities. The dataset consists of three subsets: HumorQA, CreativeQA, and MagicQA, each with specific tasks tailored to assess model performance in timestamp localization, detailed description, and reasoning around counter-intuitiveness. FunMentor is introduced as an agent to refine Vision-Language Models (VLMs) through multi-turn dialogues for enhanced understanding of surprising content.
Başka Bir Dile
kaynak içeriğinden
arxiv.org
Önemli Bilgiler Şuradan Elde Edildi
by Binzhu Xie,S... : arxiv.org 03-25-2024
https://arxiv.org/pdf/2306.14899.pdfDaha Derin Sorular