Conceptos Básicos
GazePointAR is a context-aware multimodal voice assistant for wearable augmented reality that leverages eye gaze, pointing gestures, and conversation history to disambiguate speech queries containing pronouns.
Resumen
The paper introduces GazePointAR, a fully-functional context-aware voice assistant (VA) for wearable augmented reality (AR) that uses eye gaze, pointing gestures, and conversation history to disambiguate speech queries containing pronouns.
Key highlights:
- GazePointAR analyzes the user's field-of-view using computer vision to extract objects, text, and faces, and replaces pronouns in the user's query with a coherent phrase describing the referent.
- The authors conducted a three-part lab study with 12 participants to evaluate GazePointAR. In Part 1, they compared GazePointAR to two commercial query systems (Google Voice Assistant and Google Lens). In Part 2, they examined GazePointAR's performance across various context-dependent tasks. In Part 3, participants brainstormed and tried their own context-sensitive queries.
- Participants appreciated GazePointAR's simplicity, naturalness, and human-like interaction, often preferring to use pronouns over full descriptions. However, they also noted limitations such as only capturing gaze data once, inability to handle multiple pronouns, and object recognition errors.
- Informed by the lab study, the authors created an improved GazePointAR prototype and conducted a first-person diary study, where the first author used GazePointAR in daily life for 5 days.
Estadísticas
"72% of respondents indicated that they use voice assistants for tasks such as playing music, setting timers, controlling IoT devices, and managing shopping lists."
"Over one-third of 1,068 dialogue turns contained referential occurrences of pronouns "it" and "that"."
Citas
"When speaking to GazePointAR, I am giving it a voice input while also interacting with the product that I am talking about. Perceptually, this is the most natural way of speaking, which is why we do this when talking to other people as well."
"If you're pointing at something, you have to use your hand. This implies that you still have use of your hands during some tasks. Also, because the jar is so close, the system shouldn't need pointing to tell what I'm talking about."
"although I use voice assistants almost every day to play music or something, I now realize that many things I look at are difficult to clearly describe in text... since with this people can now input their environment easily, I think it will make speaking to voice assistants easier in many everyday activities."