Core Concepts
Stacking Joint Embedding Architectures (JEA) in a hierarchical manner enables self-supervised learning of separable and interpretable visual representations that capture hierarchical semantic concepts, leading to improved performance in downstream tasks.
Stats
S-JEA with two stacked ResNet-18 encoders achieves a top-1 accuracy of 81.6% on CIFAR-10, outperforming the baseline VICReg (ResNet-18) with 80.5% accuracy.
On STL-10, S-JEA (stack 0) achieves 76.5% top-1 accuracy, surpassing the baseline VICReg (ResNet-18) with 75.9%.
S-JEA with two ResNet-18 encoders has a comparable parameter count (23.2 million) to a deeper ResNet-50 model (23 million parameters).
Linear evaluation on CIFAR-10 shows that S-JEA performs comparably to VICReg with ResNet-50, indicating that the performance gain is not solely due to increased parameter count.