Core Concepts
Unsupervised object discovery can be improved by selectively masking background regions during training and using multi-query slot attention to learn more stable and generalizable object representations.
Abstract
The paper proposes a method for unsupervised object discovery that combines two key innovations:
Selective Masking: During training, the method selectively masks the background regions of the input image, forcing the model to focus on learning representations of the salient objects. This is in contrast to random masking, which can degrade the model's ability to learn meaningful object-centric representations.
Multi-Query Slot Attention: The method extends the standard slot attention mechanism by using multiple independent sets of slot queries. Each set of slots is trained independently, and during inference, the multiple slot representations are combined through Hungarian matching to obtain the final object representations. This multi-query approach helps to stabilize the object discovery process and produce more robust results.
The authors evaluate their method on the PASCAL-VOC 2012 dataset and show that the combination of selective masking and multi-query slot attention consistently outperforms previous unsupervised object discovery methods in terms of correct localization (CorLoc), mean intersection over union (mIoU), and mean best overlap (mBo). The ablation studies further demonstrate the importance of each component of the proposed approach.
Stats
Masking 70% of the background patches during training improves object localization performance compared to random masking.
Using 8 independent sets of slot queries and combining them through Hungarian matching leads to better results than using a single set of slots.
Increasing the number of slot query heads from 1 to 8 improves the performance and reduces the standard deviation of the results.
Quotes
"Unsupervised object discovery can be improved by selectively masking background regions during training and using multi-query slot attention to learn more stable and generalizable object representations."
"Our experimental results and ablations on the PASCAL-VOC 2012 dataset show the importance of each component and highlight how their combination consistently improves object localization."