The paper explores how to leverage multi-modal large language models (MLLMs) to assist visually impaired (VI) individuals. It first analyzes the challenges faced by current MLLMs in handling queries from VI users, such as poor image quality, incomplete target in the image, and target completely out of the image.
To address these challenges, the paper introduces VIAssist, an MLLM designed specifically for VI users. VIAssist is fine-tuned on an instruction dataset collected for this purpose, which includes various types of low-quality images and corresponding questions and responses.
For low-quality images, VIAssist can provide actionable and detailed suggestions for retaking photos, such as adjusting the shooting angle, distance, and framing. Upon capturing a high-quality photo, VIAssist is capable of producing reliable answers to queries from VI users.
Both qualitative and quantitative results show that VIAssist outperforms existing MLLMs in terms of response quality, as measured by BERTScore and ROUGE. VIAssist's responses exhibit higher semantic similarity and better align with the original intent, while also being more concise and practical for VI users.
The paper also discusses future research directions, including expanding the instruction dataset, enabling automatic camera adjustments, improving real-time performance, and exploring the use of additional modalities beyond images to further enhance the assistance provided to VI individuals.
翻译成其他语言
从原文生成
arxiv.org
更深入的查询