Investigating Model Merging in Text Transformers
The author explores the connectivity between separately trained Transformer models through permutation-based merging, revealing lower loss barriers and less isolated minima. This work sheds light on the geometric properties of Transformer minima and their implications for optimization techniques.