The proposed system utilizes Aria smart glasses with embedded RGB cameras to capture egocentric video of the user's surroundings. The video is processed using state-of-the-art object detection and optical character recognition (OCR) techniques to extract textual information, such as from a restaurant menu. The extracted text is then fed into a Large Language Model (LLM), specifically GPT4, to create a digital representation of the menu.
The system also incorporates the user's personal preferences, such as dietary restrictions or food likes/dislikes, which are retrieved from various sources (e.g., bank transactions, Google Photos, Google Maps). The LLM-based chatbot then uses this personalized information to provide contextual and tailored recommendations to the user, such as suggesting suitable menu items.
The system was evaluated in a real-world setting by having four participants, each with a different native language, interact with the system while reading menus from various restaurants. The results showed a high accuracy of 96.77% in text retrieval and all participants rated the system's performance and recommendations as highly satisfactory, with an average rating of 4.87 out of 5.
The proposed framework highlights the potential of integrating egocentric vision, LLMs, and personalized data to create an effective and accessible reading assistance solution for visually impaired individuals, addressing challenges in daily activities and improving their independence and quality of life.
翻譯成其他語言
從原文內容
arxiv.org
深入探究