Saanum, T., Schulze Buschoff, L. M., Dayan, P., & Schulz, E. (2024). Next state prediction gives rise to entangled, yet compositional representations of objects. Preprint. Under review. arXiv:2410.04940v1 [cs.LG]
This research investigates whether distributed representation models, unlike slotted models, can develop compositional representations of objects in an unsupervised manner when trained on videos of object interactions. The study aims to understand if distributed coding schemes, which allow for entangled representations, offer advantages over purely object-centric coding schemes.
The researchers trained two classes of distributed models (auto-encoding and contrastive) on five datasets of dynamically interacting objects. They compared these models to their object-centric counterparts, which explicitly separate object representations into distinct slots. The models were evaluated on their ability to predict object dynamics and the linear separability of their object representations. A novel metric, inspired by Higgins et al. (2016), was used to quantify the accuracy with which object identities could be linearly decoded from the models' latent representations.
The study provides evidence that unsupervised training on dynamic object data can lead to the emergence of linearly separable object representations even in models without explicit object slots. The authors argue that partially entangled representations, as opposed to completely disentangled ones, might offer advantages for generalization by allowing models to learn shared representations of object transformations.
This research contributes to the understanding of how compositional representations emerge in artificial neural networks and challenges the notion that complete disentanglement is necessary for generalization. The findings have implications for the design of more efficient and generalizable machine learning models, particularly in domains involving object-centric reasoning and prediction.
The study primarily focuses on unsupervised learning with static and dynamic prediction objectives. Future research could explore object separability in self-supervised learning settings and with different model architectures like Vision Transformers. Investigating the scalability of these findings to naturalistic and real-world video datasets is also crucial.
To Another Language
from source content
arxiv.org
Djupare frågor