The author argues that the Embodied Language Model (ELM) enhances agents' understanding of driving scenes in space and time, surpassing previous state-of-the-art approaches by incorporating space-aware pre-training and time-aware token selection.
ELM introduces a comprehensive framework for agents to understand driving scenes with large spatial and temporal spans, surpassing previous approaches in various applications.