Core Concepts
This paper proposes UnCounTR, a model that can learn to count objects in images without requiring any manual annotations. The key idea is to construct "Self-Collages" - images with various pasted objects as training samples, which provide a rich learning signal covering arbitrary object types and counts.
Abstract
The paper focuses on the objective of training a reference-based counting model without any manual annotation. The key contributions are:
A simple yet effective data generation method to construct 'Self-Collages', which pastes objects onto an image and gets supervision signals for free.
Leveraging self-supervised pretrained visual features from DINO and developing UnCounTR, a transformer-based model architecture for counting.
Experiments showing that the proposed method trained without manual annotations not only outperforms baselines and generic models like FasterRCNN and DETR, but also matches the performance of supervised counting models in some domains.
The paper first introduces the Self-Collage data generation method, where objects from ImageNet-1k are pasted onto background images from SUN397 to create training samples with pseudo-labels. Then, it presents the UnCounTR model architecture, which uses a frozen DINO visual encoder and a transformer-based decoder to predict the object density map.
The experiments evaluate UnCounTR on the FSC-147, MSO, and CARPK datasets. It outperforms baselines like connected components and object detectors on most metrics, and even matches the performance of the supervised CounTR model on the low and medium count ranges of FSC-147. Further improvements to UnCounTR, such as using DINOv2 and refining high-count predictions, yield the final UnCounTRv2 model with even stronger performance.
Finally, the paper demonstrates the potential of UnCounTR for self-supervised semantic counting, where the model can identify and count different object categories in an image without any annotations.
Stats
The paper does not provide any specific numerical data or statistics. The key figures are:
101
102
103
annotated count
101
102
103
prediction
Supervised topline
ground-truth
101
102
103
annotated count
101
102
103
prediction
Ours (unsupervised)
ground-truth
Quotes
"While recent supervised methods for reference-based object counting continue to improve the performance on benchmark datasets, they have to rely on small datasets due to the cost associated with manually annotating dozens of objects in images."
"We propose UnCounTR, a model that can learn this task without requiring any manual annotations."