Learning and Leveraging World Models in Visual Representation Learning
The author explores the use of Image World Models (IWM) for self-supervised learning, emphasizing the importance of conditioning, transformation complexity, and predictor capacity in achieving strong world models. The study shows that leveraging a capable world model through predictor finetuning can match or surpass encoder finetuning at a fraction of the cost.