Comprehensive Survey of Vision-Language Models: Advancements, Capabilities, and Future Directions
This comprehensive survey paper delves into the key advancements within the realm of Vision-Language Models (VLMs), categorizing them into three distinct groups based on their input processing and output generation capabilities: Vision-Language Understanding Models, Multimodal Input Text Generation Models, and Multimodal Input-Multimodal Output Models. The paper provides an extensive analysis of the foundational architectures, training data sources, strengths, and limitations of various VLMs, offering readers a nuanced understanding of this dynamic domain. It also highlights potential avenues for future research in this rapidly evolving field.