Efficient omni-modal representation learning with VIT-LENS through pretrained-ViT for diverse modalities.
VIT-LENS facilitates efficient omni-modal representation learning by leveraging pretrained-ViT models, enabling emergent downstream capabilities.