The paper presents the EMC2 algorithm for optimizing the global contrastive loss in contrastive learning. The key highlights are:
EMC2 utilizes an adaptive Metropolis-Hastings (M-H) subroutine to generate hardness-aware negative samples in an online fashion during the optimization. This avoids the need to compute the partition function in the softmax distribution, which is computationally expensive.
The authors prove that EMC2 finds an O(1/√T)-stationary point of the global contrastive loss in T iterations. This global convergence guarantee holds regardless of the choice of batch size, in contrast to prior works.
Numerical experiments on pre-training image encoders on STL-10 and Imagenet-100 datasets show that EMC2 is effective with small batch training and achieves comparable or better performance than baseline algorithms.
The analysis involves a non-trivial adaptation of the generic result for biased stochastic approximation schemes. The authors show that the state-dependent Markov transition kernel induced by EMC2 is ergodic and Lipschitz continuous with respect to the model parameter θ.
翻譯成其他語言
從原文內容
arxiv.org
深入探究