MOFI is a vision foundation model designed to learn image representations from noisy entity annotated images, achieving state-of-the-art performance.