The author argues that by using prediction as the main learning objective, a novel network architecture called OPPLE can simultaneously learn object segmentation, depth perception, and 3D object localization without supervision. This approach is inspired by how infants develop perceptual abilities.
Objects are learned through prediction, mimicking human infant abilities.
A simple, scalable, and non-iterative method called SAMP (Simplified Slot Attention with Max Pool Priors) that learns object-centric representations from images by inducing competition and specialization among slots.
Training models to predict future states in dynamic environments can lead to the emergence of linearly separable object representations, even in models without explicit object-centric architectural priors, suggesting that partially entangled representations can be beneficial for generalization.