Multimodal Task Alignment (MTA): Enhancing Bird's-Eye View Perception and Captioning for Autonomous Driving by Aligning Visual and Language Modalities
Aligning visual and language modalities in autonomous driving systems significantly improves both the accuracy of 3D perception tasks and the quality of generated captions, as demonstrated by the MTA framework.