Core Concepts
A self-supervised framework called Learning to Rank Patches (LTRP) is proposed to fairly and effectively reduce image redundancy by quantifying the semantic variation between reconstructions with and without each visible patch, and then learning to rank the patches accordingly.
Abstract
The content presents a self-supervised framework called Learning to Rank Patches (LTRP) for unbiased image redundancy reduction.
Key highlights:
Current leading methods for image redundancy reduction rely on supervised signals, which can lead to categorical inductive bias and preserve content that aligns with labeled categories while discarding content from unlabeled categories.
LTRP addresses this issue by leveraging a pre-trained masked autoencoder (MAE) model to infer a pseudo score for each visible patch, quantifying the semantic variation between reconstructions with and without that patch.
The pseudo scores are then used as labels to train a ranking model that learns to rank the patches accordingly, enabling fair and effective redundancy reduction in a self-supervised manner.
Extensive experiments on various datasets and tasks demonstrate that LTRP outperforms both supervised and other self-supervised methods, as it can unbiasedly preserve meaningful semantics regardless of whether they belong to the learned categories.
LTRP-based solutions also show promising results for efficient vision transformers, achieving notable inference speedup with negligible accuracy degradation.
Stats
The content does not provide any specific metrics or figures to support the key logics. The focus is on the proposed self-supervised framework and its evaluation.