Conceitos Básicos
Weakly Supervised Object Localization (WSOL) methods can train deep learning models for classification and localization using only global class-level labels, without bounding box supervision. However, the lack of bounding box supervision during training makes hyper-parameter search and model selection challenging. This paper proposes a realistic model selection approach for WSOL that leverages pseudo-bounding boxes generated from off-the-shelf pretrained models, without the need for manual annotation.
Resumo
The paper addresses the challenge of model selection in Weakly Supervised Object Localization (WSOL), where the lack of bounding box supervision during training makes hyper-parameter search and model selection difficult.
The authors first show that using manually annotated bounding boxes in the validation set can lead to an overestimation of localization performance on the test set, as it provides strong supervision that is not available in real-world applications. On the other hand, using only image-class labels for model selection leads to poor localization performance.
To address this, the authors propose a realistic model selection approach that leverages pseudo-bounding boxes generated from off-the-shelf pretrained models, such as Selective-Search, CLIP, and RPN. These pseudo-bounding boxes are generated without manual intervention, using only image-class labels. The authors show that despite being less accurate than manual annotations, these pseudo-bounding boxes can effectively be used for model selection in WSOL, achieving performance close to models selected using ground truth bounding boxes.
The key steps of the proposed approach are:
Generating pseudo-bounding boxes using different pretrained models (Selective-Search, CLIP, RPN) and a pointing game analysis to select the most discriminative boxes.
Using the generated pseudo-bounding boxes in the held-out validation set for model selection, instead of manual annotations.
Extensive experiments on the CUB-200-2011 and ILSVRC datasets, showing that models selected using the pseudo-bounding boxes achieve performance close to those selected using ground truth bounding boxes, and better than models selected using only image-class labels.
The authors make the generated pseudo-bounding boxes publicly available to help researchers design more realistic WSOL methods.
Estatísticas
"Using only coarse image-class labels, a deep model can be trained to perform image classification tasks while yielding the spatial image region of the object (localization)."
"Early works may have unintentionally observed the test performance for hyper-parameter selection leading to an overestimation of model performance."
"Annotating 10 images/class for the ILSVRC dataset amounts to 10,000 images. While such cost may be acceptable in some applications with natural images, it can become expensive in domains such as medical imaging that require experts for annotation."
Citações
"The lack of bounding box (bbox) supervision during training represents a considerable challenge for hyper-parameter search and model selection."
"Our initial empirical analysis shows that the localization performance of a model declines significantly when using only image-class labels for model selection (compared to using bounding-box annotations)."
"This suggests that adding bounding-box labels is preferable for selecting the best model for localization."