This paper introduces TIPS, a novel image-text model that leverages synthetic image captions and self-supervised learning techniques to achieve state-of-the-art performance on both dense and global vision tasks.


coremsg

tips-text-image-pretraining-with-spatial-awareness-achieving-strong-performance-on-dense-and-global-vision-tasks


TIPS: Text-Image Pretraining with Spatial Awareness - Achieving Strong Performance on Dense and Global Vision Tasks


title_rewrite


This paper introduces V2M, a novel image representation learning framework that leverages a 2-Dimensional State Space Model (SSM) to effectively capture local spatial dependencies within images, leading to improved performance in image classification and downstream vision tasks compared to existing methods relying on 1D SSMs.


v2m-a-novel-2d-state-space-model-architecture-for-image-representation-learning


V2M: A Novel 2D State Space Model Architecture for Image Representation Learning



MOFI introduces a new vision foundation model, leveraging noisy entity annotated images to learn image representations effectively.


mofi-learning-image-representations-from-noisy-entity-annotated-images-at-iclr-2024


MOFI: Learning Image Representations from Noisy Entity Annotated Images at ICLR 2024