The paper introduces JoMA, a framework that integrates self-attention and MLP layers to analyze training dynamics. It explains how attention becomes sparse then dense, focusing on salient features first. Theoretical findings are validated with experiments on real-world datasets and pre-trained models. The study also explores the expressiveness of attention-based models and training dynamics in neural networks. Insights into hierarchical data distribution learning are provided through a generative model analysis.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Yuandong Tia... at arxiv.org 03-18-2024
https://arxiv.org/pdf/2310.00535.pdfDeeper Inquiries