Sign In

Practical Task-Driven Drivers' Gaze Prediction Using Map and Route Information

Core Concepts
Accurate prediction of drivers' gaze is crucial for vision-based driver monitoring and assistive systems, especially during safety-critical episodes such as performing maneuvers or crossing intersections. The proposed SCOUT+ model leverages map and route information inferred from commonly available GPS data to effectively model task and context influences on drivers' attention.
The paper introduces SCOUT+, a task- and context-aware model for drivers' gaze prediction that utilizes map and route information inferred from GPS data. This is an extension of the previous SCOUT model, which relied on hand-coded text labels for representing task and context. The key highlights are: Map and route information is extracted from GPS data using the OpenStreetMap API and map matching techniques. This information is then rasterized and used as input to the model. The SCOUT+ architecture consists of a scene encoder, a map encoder, and a scene-map transformer that fuses the visual and map/route features using cross-attention. Experiments on the DR(eye)VE and BDD-A datasets show that SCOUT+ achieves performance comparable to the previous SCOUT model, which used privileged ground truth information for task and context. Significant improvements are observed on challenging scenarios involving lateral actions and intersections. Adding map information alone, without any visual input, also results in competitive performance, highlighting the usefulness of the map representation. Fusing map features with later encoder blocks of the scene encoder leads to better results compared to earlier blocks, suggesting that high-level spatio-temporal information is more beneficial for gaze prediction. Overall, the proposed SCOUT+ demonstrates that leveraging commonly available GPS data can effectively model task and context influences, making the system more practical for real-world deployment.
The model was evaluated on the following key metrics: Kullback-Leibler divergence (KLD): Lower is better Pearson's correlation coefficient (CC): Higher is better Normalized scanpath saliency (NSS): Higher is better Histogram similarity (SIM): Higher is better
"Accurate prediction of drivers' gaze is an important component of vision-based driver monitoring and assistive systems." "Explicit modeling of top-down factors affecting drivers' attention often requires additional information and annotations that may not be readily available." "We address the challenge of effective modeling of task and context with common sources of data for use in practical systems."

Key Insights Distilled From

by Iuliia Kotse... at 04-16-2024
SCOUT+: Towards Practical Task-Driven Drivers' Gaze Prediction

Deeper Inquiries

How can the map representation be further improved to capture more detailed information about the road layout, lane markings, and presence of other road users

To enhance the map representation for a more detailed understanding of the road environment, several improvements can be implemented: Lane-Level Information: Incorporating lane-level data into the map representation can provide crucial details about lane markings, lane changes, and lane occupancy. This can be achieved by integrating lane detection algorithms or utilizing high-definition maps that include lane-level information. Road Object Detection: Implementing object detection algorithms to identify and classify various road users such as vehicles, pedestrians, cyclists, and obstacles can enrich the map representation. This information can help in predicting driver gaze more accurately by considering the presence and movements of other road users. Traffic Sign Recognition: Integrating traffic sign recognition systems can help in identifying traffic signs, signals, and road markings. This information can contribute to a more comprehensive map representation by including regulatory elements that influence driver behavior and attention. Road Geometry: Enhancing the map with detailed road geometry data, including curves, intersections, roundabouts, and road layouts, can provide a more holistic view of the driving environment. This can aid in predicting driver gaze during complex maneuvers and challenging driving scenarios.

What other sensor data, in addition to GPS, could be leveraged to better model the driver's task and context

In addition to GPS data, leveraging the following sensor data can enhance the modeling of the driver's task and context: Camera Data: Integrating data from onboard cameras can offer visual cues about the driver's surroundings, including traffic conditions, road signs, and the behavior of other road users. This visual information can complement GPS data and provide a more comprehensive understanding of the driving context. LiDAR and Radar: Utilizing LiDAR and radar sensors can enable the detection and tracking of objects around the vehicle, such as vehicles, pedestrians, and cyclists. This data can enhance the awareness of the driver's surroundings and assist in predicting gaze behavior based on the proximity and movements of surrounding objects. Vehicle Telemetry: Incorporating vehicle telemetry data, such as speed, acceleration, steering angle, and braking status, can provide valuable insights into the driver's actions and intentions. By analyzing these parameters in conjunction with GPS data, a more accurate representation of the driver's task can be achieved. Communication Systems: Integrating vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communication systems can provide real-time information about traffic conditions, road hazards, and cooperative driving scenarios. This data can enrich the driver's context representation and improve the prediction of gaze behavior in dynamic driving environments.

How can the proposed approach be extended to handle more complex driving scenarios, such as urban environments with higher traffic density and more diverse maneuvers

To extend the proposed approach for handling more complex driving scenarios in urban environments with higher traffic density and diverse maneuvers, the following strategies can be implemented: Semantic Segmentation: Incorporating semantic segmentation algorithms to classify road scenes into different categories such as roads, sidewalks, vehicles, pedestrians, and buildings can provide a detailed understanding of the urban environment. This segmentation can help in capturing complex scenarios and predicting driver gaze based on specific scene elements. Dynamic Object Tracking: Implementing real-time object tracking algorithms to monitor the movements of vehicles, pedestrians, and cyclists in urban settings can enhance the model's ability to predict driver gaze in response to dynamic and unpredictable events. By tracking objects' trajectories, the model can anticipate potential areas of interest for the driver. Multi-Modal Fusion: Integrating data from multiple sensors, including cameras, LiDAR, radar, and GPS, through multi-modal fusion techniques can provide a comprehensive view of the urban driving environment. By combining information from different modalities, the model can capture the complexity of urban scenarios and adapt to diverse driving conditions. Behavior Prediction: Incorporating behavior prediction models for other road users can help anticipate their actions and interactions with the ego-vehicle. By considering the intentions and movements of surrounding entities, the model can better predict driver gaze in response to potential hazards, interactions, and navigation decisions in urban traffic scenarios.