Efficient Initialization of Variable-sized Transformer Models via Stage-wise Weight Sharing
A simple but effective Learngene approach termed Stage-wise Weight Sharing (SWS) that efficiently initializes variable-sized Transformer models by integrating stage information and expansion guidance into the learned learngenes.