This research introduces CECE (Caption Expansion with Contradictions and Entailments), a novel method leveraging Natural Language Inference (NLI) to improve the compositional reasoning capabilities of Vision-Language Models (VLMs).
Open-source Vision-Language Models (VLMs) can achieve state-of-the-art performance comparable to closed-source models by scaling training with large, high-quality instruction datasets and synthetic data.
BLIP-3-Video achieves competitive video understanding performance by using a novel temporal encoder to represent videos with significantly fewer visual tokens (32 vs. thousands), improving efficiency without sacrificing accuracy.
TULIP, a novel method for upgrading CLIP-like vision-language models, effectively handles long captions by incorporating relative positional encodings and a two-step adaptation process, leading to significant improvements in cross-modal retrieval and text-to-image generation tasks.