The paper proposes a novel Temporal BEV (TempBEV) encoder that effectively combines temporal aggregation in both image and BEV latent spaces. The authors first provide a comprehensive survey of existing temporal aggregation mechanisms used in learned BEV encoders. They then conduct a comparative study to evaluate the effectiveness of different temporal aggregation operators, including attention, convolution, and max pooling.
The key insights from the survey and comparative study are:
Temporal aggregation in image and BEV latent spaces exhibit complementary strengths. Image space aggregation can capture short-term motion cues, while BEV space aggregation enables leveraging ego-motion compensation to aggregate information over longer time horizons.
Most existing approaches focus on temporal aggregation in either image or BEV space, missing the potential synergies of combining both.
Based on these findings, the authors develop the TempBEV model, which integrates temporal aggregation in both image and BEV spaces. Specifically:
Experiments on the NuScenes dataset show that TempBEV significantly outperforms the baseline BEVFormer model on both 3D object detection and BEV segmentation tasks. The ablation study further confirms the strong synergy of combining temporal aggregation in image and BEV spaces.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Thomas Monni... at arxiv.org 04-19-2024
https://arxiv.org/pdf/2404.11803.pdfDeeper Inquiries