Core Concepts
A two-stage gaze estimation framework that combines event and frame data, utilizing anchor states and denoising distillation to achieve highly accurate gaze tracking.
Abstract
The paper presents a novel gaze estimation framework that leverages both event and frame data to achieve superior performance. The key highlights are:
Formulation of gaze estimation as a process of modeling the state transition from a baseline anchor state to the current state, capturing the complex head-eye coordination dynamics.
A two-stage architecture that first selects the most representative anchor state using an MLP, then utilizes transformers to model the correlation between the anchor and current states for accurate gaze prediction.
Introduction of a denoising distillation method that amalgamates the expertise of multiple pre-trained local expert networks into a single, more robust student network. This helps mitigate the adverse effects of noise in the event data.
Extensive experiments demonstrating the effectiveness of the proposed approach, which outperforms state-of-the-art gaze estimation methods by a large margin of 15% in accuracy.
The authors also conduct detailed ablation studies to analyze the impact of various components, such as the number of anchor states, gradient accumulation steps, and the weight of the feature map loss. The results highlight the importance of these design choices in achieving the superior performance.
Stats
The paper does not provide specific numerical data or metrics, but focuses on reporting the overall performance improvements compared to state-of-the-art methods.
Quotes
"We formulate the gaze estimation as a end-to-end prediction of state shifting from selected anchor state"
"We distill multiple pre-trained local expert networks into a more robust student network to combat overfitting in gaze estimation"
"We propose a self-supervised latent denoising method to mitigate the adverse effects of noise from expert networks to improve the performance of the student network"