Leveraging Vision-Language Models for Generating Diverse and Synchronized Sound Effects for Videos
SonicVisionLM, a novel framework, leverages the capabilities of powerful vision-language models (VLMs) to generate a wide range of sound effects that are semantically relevant and temporally synchronized with silent videos.