The paper proposes a two-stage method called Task Decomposition (TaDe) to address the challenge of monocular bird's eye view (BEV) semantic segmentation.
In the first stage, a BEV autoencoder is trained to reconstruct BEV segmentation maps from corrupted latent representations. This forces the decoder to learn fundamental patterns of typical BEV scenes, ensuring the generation of coherent and rational BEV maps.
In the second stage, an RGB-BEV alignment module is trained to map the input RGB images into the BEV latent space learned in the first stage. This allows the pre-trained BEV decoder to be directly used for BEV segmentation map prediction during inference, without the need for depth estimation or multi-scale features.
Additionally, the authors propose to transform the BEV segmentation maps from Cartesian to polar coordinates to establish better column-wise correspondence between the RGB images and BEV maps. This helps the alignment module learn the transition between the two views more effectively.
Extensive experiments on the nuScenes and Argoverse datasets demonstrate the superiority of the proposed TaDe method over previous end-to-end approaches, in terms of both accuracy and computational efficiency.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問