BurstAttention introduces a distributed attention framework to handle extremely long sequences by partitioning attention across devices. It combines global attention optimization (GAO) and local attention optimization (LAO) strategies to optimize memory access and communication operations. BurstAttention achieves significant advantages in processing long sequences compared to competitive baselines, reducing communication overheads and improving data throughput. The framework leverages double-buffer techniques to overlap communication and computation, enhancing efficiency. Additionally, BurstAttention integrates well with sparse attention methods for further optimization.
Іншою мовою
із вихідного контенту
arxiv.org
Ключові висновки, отримані з
by Sun Ao,Weili... о arxiv.org 03-15-2024
https://arxiv.org/pdf/2403.09347.pdfГлибші Запити