An efficient pipeline, RED-VILLM, is proposed to quickly develop high-performing Video Large Language Models (Video LLMs) by leveraging the foundational work of Image Large Language Models (Image LLMs) and incorporating a plug-and-play temporal adaptation module.
SlowFast-LLaVA is a training-free video large language model that can effectively capture both detailed spatial semantics and long-range temporal context without exceeding the token budget of commonly used language models.