The content explores the vulnerability of Large Multimodal Models to typographic attacks and proposes a solution through informative prompts. It introduces a Typographic Dataset to evaluate distractibility across various tasks, highlighting the impact of typography factors on model performance.
The study reveals that even imperceptible typos can mislead models, showcasing the need for enhanced prompt information. By analyzing the role of vision encoders and conducting experiments with state-of-the-art LMMs, the research provides insights into addressing typographic vulnerabilities in multimodal models.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Hao Cheng,Er... lúc arxiv.org 03-01-2024
https://arxiv.org/pdf/2402.19150.pdfYêu cầu sâu hơn