Core Concepts
VIAssist, an MLLM tailored for visually impaired users, can provide actionable suggestions to capture high-quality images and generate reliable answers to queries based on the images.
Abstract
The paper explores how to leverage multi-modal large language models (MLLMs) to assist visually impaired (VI) individuals. It first analyzes the challenges faced by current MLLMs in handling queries from VI users, such as poor image quality, incomplete target in the image, and target completely out of the image.
To address these challenges, the paper introduces VIAssist, an MLLM designed specifically for VI users. VIAssist is fine-tuned on an instruction dataset collected for this purpose, which includes various types of low-quality images and corresponding questions and responses.
For low-quality images, VIAssist can provide actionable and detailed suggestions for retaking photos, such as adjusting the shooting angle, distance, and framing. Upon capturing a high-quality photo, VIAssist is capable of producing reliable answers to queries from VI users.
Both qualitative and quantitative results show that VIAssist outperforms existing MLLMs in terms of response quality, as measured by BERTScore and ROUGE. VIAssist's responses exhibit higher semantic similarity and better align with the original intent, while also being more concise and practical for VI users.
The paper also discusses future research directions, including expanding the instruction dataset, enabling automatic camera adjustments, improving real-time performance, and exploring the use of additional modalities beyond images to further enhance the assistance provided to VI individuals.
Stats
The image you've uploaded shows a yellow paper sign on the wall with some text on it, but only the portion "n Door" is visible.
The image you uploaded shows a sign with Chinese characters on it. The visible part of the sign says "保健中心U11", which translates to "Health Center U11".
Quotes
"VIAssist can provide actionable and detailed advice on retaking photos, such as adjusting the shooting angle, distance, and framing."
"Upon capturing a high-quality photo, VIAssist is capable of producing reliable answers to queries from VI users."