Core Concepts
LeGo-Drive is a framework that addresses the feasibility of coarse estimation of control actions from Vision-Language Assistants (VLAs) by treating it as a short-term goal-reaching problem. This is achieved through the learning of trajectory optimizer parameters together with behavioral inputs by generating and improving a feasible goal aligned with the navigation instruction.
Abstract
The LeGo-Drive framework consists of two main components:
-
Goal Prediction Module:
- Takes a front-view image and a corresponding language command as input.
- Generates a segmentation mask and predicts a goal location based on the given instruction.
- Utilizes CLIP image and text encoders, along with a transformer encoder, to capture the multi-modal context.
- Includes two decoder heads for segmentation mask prediction and goal point prediction.
-
Differentiable Planner:
- Formulates the trajectory planning as an optimization problem in the Frenet frame.
- Parameterizes the motion along the X-Y directions using a smooth basis function.
- Introduces a conditioning mechanism to accelerate the convergence of the optimizer by using a partial solution.
- Integrates the planner as a differentiable module within the overall architecture, allowing for end-to-end training.
The key innovation of LeGo-Drive is the end-to-end training approach, where the goal prediction module and the planner module are jointly optimized. This allows the goal prediction to be aware of the downstream planner's capabilities, leading to the prediction of feasible goal positions that can be effectively reached by the planner.
The authors conduct extensive experiments in diverse simulated environments and report significant improvements in standard autonomous driving metrics, with a goal reaching Success Rate of 81%. They also showcase the versatility of LeGo-Drive across different driving scenarios and linguistic inputs, underscoring its potential for practical deployment in autonomous vehicles and intelligent transportation systems.
Stats
The dataset used in this work, called LeGo-Drive, consists of 4500 training and 1000 validation data points. It includes a variety of driving maneuvers, such as lane changes, speed adjustments, turns, passing or stopping for other objects or vehicles, navigating through intersections, and stopping at crosswalks or traffic signals.
The dataset was collected in the CARLA simulator, with the ego-agent's states, front RGB camera image, and traffic agent states recorded at 10 FPS. The navigation instructions were manually annotated, covering three different command categories: object-centric, lane maneuver, and composite commands.
Quotes
"The core challenge in performing language-conditioned goal prediction is that the network should be aware of the vehicle and the scene constraints i.e., predicting a goal that is outside the drivable area is undesirable."
"The advantages of such a goal-oriented approach are three-fold. First, the dataset need to be annotated for matching language commands to just goal positions. Moreover, our results are based on providing the supervision of only a very coarse goal region conditioned on the language command, which is easier to obtain compared to the demonstration of the complete driving trajectory. Second, as discussed in [2, 10, 11], goal-directed planning improves the explainability in autonomous driving. Finally, predicting just the goal position would allow the use of smaller and lightweight networks and consequently less data for training, plus faster inference time."