Core Concepts
Neural networks exhibit permutation symmetry, where reordering neurons in each layer does not change the underlying function they compute. This contributes to the non-convexity of the networks' loss landscapes. Recent work has argued that permutation symmetries are the only sources of non-convexity, meaning there are essentially no loss barriers between trained networks if they are permuted appropriately. This work refines these arguments into three distinct claims of increasing strength, and provides empirical evidence for the strongest claim.
Abstract
The paper investigates different notions of linear connectivity of neural networks modulo permutation. It makes the following key observations:
Existing evidence only supports "weak linear connectivity" - that for each pair of networks, there exist permutations that linearly connect them.
The stronger claim of "strong linear connectivity" - that for each network, there exists one permutation that simultaneously connects it with other networks - is both intuitively and practically more desirable, as it would imply a convex loss landscape after accounting for permutation.
The paper introduces an intermediate claim of "simultaneous weak linear connectivity" - that for certain sequences of networks, there exists one permutation that simultaneously aligns matching pairs of networks from these sequences.
The paper provides empirical evidence for simultaneous weak linear connectivity:
It shows that a single permutation can align SGD training trajectories, meaning the networks exhibit low loss barriers at each step of optimization.
It also demonstrates that the same permutation can align sequences of iteratively pruned networks.
Furthermore, the paper provides the first evidence towards strong linear connectivity, by showing that barriers decrease with increasing network width when interpolating among three networks.
The paper also discusses limitations of weight matching and activation matching algorithms used for aligning networks, and how they relate to network stability and feature emergence during training.