The paper introduces EM-VLM4AD, an efficient and lightweight multi-frame vision-language model designed for visual question answering in autonomous driving applications.
The key highlights are:
EM-VLM4AD uses a custom image embedding network that aggregates embeddings from multiple camera views using gated pooling attention, and a pre-trained T5 language model as the backbone.
Two versions of EM-VLM4AD are explored - one using a T5-Base language model, and another using an 8-bit quantized T5-Large model. Both versions outperform the existing DriveLM-Agent baseline on the DriveLM dataset in BLEU-4, METEOR, ROUGE-L, and CIDEr metrics.
Computational analysis shows that EM-VLM4AD requires at least 10 times less memory and FLOPs compared to other large language model-based approaches for autonomous driving, making it more suitable for real-time deployment.
Qualitative results demonstrate EM-VLM4AD's ability to accurately answer a variety of questions related to perception, traffic agent behavior, and planning for autonomous driving tasks. However, it struggles with some grammatical issues and questions related to ego-vehicle behavior prediction.
The authors conclude by discussing plans to evolve EM-VLM4AD into a video-language model and incorporate multimodal retrieval to further enhance its capabilities.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Akshay Gopal... at arxiv.org 04-01-2024
https://arxiv.org/pdf/2403.19838.pdfDeeper Inquiries