This paper introduces TIPS, a novel image-text model that leverages synthetic image captions and self-supervised learning techniques to achieve state-of-the-art performance on both dense and global vision tasks.
This paper introduces V2M, a novel image representation learning framework that leverages a 2-Dimensional State Space Model (SSM) to effectively capture local spatial dependencies within images, leading to improved performance in image classification and downstream vision tasks compared to existing methods relying on 1D SSMs.
MOFI introduces a new vision foundation model, leveraging noisy entity annotated images to learn image representations effectively.