This paper introduces a solution that integrates Parallel Dense Video Captioning (PDVC) with CLIP visual features to improve dense captioning of traffic safety scenario videos, addressing real-world challenges through an end-to-end approach.
CityLLaVA introduces an efficient fine-tuning framework for Visual Language Models (VLMs) to enhance their comprehension and prediction accuracy in urban traffic scenarios, including bounding box-guided visual prompt engineering, textual prompt construction, and block expansion-based fine-tuning.