The paper presents a novel method called SAMP (Simplified Slot Attention with Max Pool Priors) for learning object-centric representations from images in a simple, scalable, and non-iterative manner.
The key components of SAMP are:
Encoder: A standard CNN-based image encoder that preserves the spatial dimensions of the input.
Grouping Module:
Decoder: A slot-wise spatial broadcast decoder that reconstructs the input image from the learned slots.
The competition and specialization in SAMP is induced through the max-pooling layers, the SSA Layer, and the slot-wise reconstruction in the decoder. This allows SAMP to learn meaningful object-centric representations without the need for iterative refinement, unlike previous methods.
SAMP is evaluated on standard object-centric learning benchmarks (CLEVR6, Multi-dSprites, Tetrominoes) and is shown to be competitive or outperform previous methods, while being simpler and more scalable.
The paper also analyzes the learned representations, visualizing the attention of the slots over the input features. It finds that the slots specialize to capture different parts of the objects, without the need for explicit iterative refinement.
إلى لغة أخرى
من محتوى المصدر
arxiv.org
الرؤى الأساسية المستخلصة من
by Vihang Patil... في arxiv.org 10-02-2024
https://arxiv.org/pdf/2410.00728.pdfاستفسارات أعمق