Efficient low-latency attention module proposed for streaming self-supervised speech representation learning.
Introducing SimA, a Softmax-free attention block for vision transformers, simplifying computation and achieving on-par results with SOTA models.
A simple modification to the conventional attention mechanism enables its linearization as a composition of log-sums of exponentials, with a fixed-size latent space, for sequential application with constant cost per token.
SageAttention is a novel quantization method that significantly accelerates transformer inference by employing 8-bit quantization for the attention mechanism while preserving accuracy, outperforming existing methods like FlashAttention2 and xformers in both speed and accuracy.