The paper introduces a novel attention mechanism called Möbius Attention that leverages Möbius transformations to improve the expressivity of Transformer-based models. Möbius transformations are non-linear operations that can map between various geometric shapes, including lines, circles, and other complex forms. By incorporating these transformations into the attention mechanism, the model can learn more intricate relationships between tokens and capture a wider range of linguistic patterns.
The authors integrate Möbius Attention into the BERT and RoFormer architectures, creating MöbiusBERT and MobRoFormer models. These enhanced models are pre-trained on the Colossal Clean Crawled Corpus (C4) dataset and then fine-tuned on the GLUE benchmark. The results show that the Möbius Attention models outperform their baseline counterparts across various GLUE tasks, including MNLI, QQP, QNLI, SST-2, and RTE, while using fewer parameters.
The authors provide a detailed analysis of the learned Möbius weights, revealing that the models capture a diverse range of complex geometries and exhibit both layer-level and head-level specialization. Additionally, the Möbius Attention mechanism is shown to learn what to "forget" rather than what to "focus on," which is different from the approach of traditional attention.
The paper also includes an ablation study that explores different architectural configurations for integrating Möbius Attention, finding that the "framed" architecture, with Möbius Attention layers at the beginning and end of the Transformer stack, achieves the best performance.
Іншою мовою
із вихідного контенту
arxiv.org
Ключові висновки, отримані з
by Anna-Maria H... о arxiv.org 09-19-2024
https://arxiv.org/pdf/2409.12175.pdfГлибші Запити