The author introduces the PIP-Net framework for accurate pedestrian intention prediction in real-world urban scenarios, leveraging kinematic data and spatial features with a recurrent and temporal attention-based approach.
The critical role of pedestrian bounding box, ego-vehicle speed, and local context features in predicting pedestrian crossing intentions, with body pose being less significant. The analysis reveals potential biases introduced by the speed feature and proposes an alternative feature representation to mitigate this.
A lightweight and efficient multi-modal fusion framework that leverages positional decoupling, graph-embedded Transformer, and ego-vehicle motion to accurately predict pedestrian crossing intentions.