The paper proposes EAGLE, a novel Multimodal Large Language Model (MLLM) that enables efficient comprehension of arbitrary referring visual prompts. Existing approaches utilize specialized feature encoding modules to capture the semantics of highlighted areas indicated by referring visual prompts, and then adapt these encoded region features to MLLMs through fine-tuning on curated multimodal instruction datasets. However, this design suffers from redundancy as it overlooks the innate region-level comprehension capabilities of MLLMs. Moreover, these methods face challenges in effectively generalizing when encountering diverse arbitrary referring visual prompts in real-life scenarios, primarily due to their sensitivity to the quality of the provided referring visual prompts.
To address these issues, the authors propose two key innovations in EAGLE:
Rendering diverse formats of referring visual prompts as colored patches onto the image, which serves as the image resources for instruction tuning. This design respects the innate region-level comprehension capabilities of MLLMs and requires less training effort compared to previous approaches.
Introducing a Geometry-Agnostic Learning (GAL) paradigm to disentangle the region-level recognition from the specific formats of referring visual prompts. GAL reformulates diverse referring visual prompts into a set of representative points, which alleviates the influence of shapes and formats on the MLLM's region-level comprehension.
Extensive experiments on semantic segmentation and arbitrary box recognition tasks demonstrate the effectiveness of EAGLE in handling diverse referring visual prompts, outperforming state-of-the-art methods. The authors also propose a novel benchmark to evaluate the performance of MLLMs against incomplete, irregularly-shaped masks, further validating the advantages of EAGLE.
翻译成其他语言
从原文生成
arxiv.org
更深入的查询