Efficient learning can be achieved by a tangling-untangling cycle that maps context-independent representations to context-dependent representations in high-dimensional space, and then collapses the context variables back to the original low-dimensional space for generalization.
The core message of this paper is to investigate the theoretical limits of learnability for out-of-distribution (OOD) detection under risk and AUC metrics. The authors discover necessary and sufficient conditions for the learnability of OOD detection in several representative domain spaces, revealing the challenges and possibilities of successful OOD detection in practice.
The computational limits of modern Hopfield models are characterized by a norm-based phase transition, where efficient sub-quadratic variants exist only when the norms of input query and memory patterns are below a certain threshold. An efficient nearly linear-time modern Hopfield model is provided as an example, maintaining exponential memory capacity.
This paper presents a novel high-probability PAC-Bayes bound that achieves a strictly tighter complexity measure than the standard Kullback-Leibler (KL) divergence. The new bound is based on a divergence measure called the Zhang-Cutkosky-Paschalidis (ZCP) divergence, which is shown to be orderwise better than the KL divergence in certain cases.
There exist average-case computational separations between multimodal and unimodal machine learning tasks, where multimodal learning is feasible in polynomial time but the corresponding unimodal task is computationally hard. However, any such separation implies the existence of cryptographic key agreement protocols, suggesting that very strong computational advantages of multimodal learning may arise infrequently in practice.
The dichotomy of early and late phase implicit biases induced by large initialization and small weight decay can provably lead to a sharp transition from memorization to generalization, a phenomenon known as "grokking", in the training of homogeneous neural networks.
The authors present new theoretical and algorithmic results for multi-class classification with abstention in the predictor-rejector framework, including introducing new surrogate losses with strong consistency guarantees.
The attention mechanism can be derived from a latent variable model induced by the exchangeability of input tokens, which enables a rigorous characterization of the representation, inference, and learning aspects of attention.
This survey explores generalization bounds for learning from graph-dependent data, where the dependencies among examples are described by a dependency graph. It presents concentration inequalities and uses them to derive Rademacher complexity and algorithmic stability generalization bounds for learning from such interdependent data.