Core Concepts
Numeric deviation introduced by the Flash Attention optimization can impact model training stability, but its significance is bounded by the effects of other common training techniques.
Abstract
The authors investigate the potential numeric deviation caused by the Flash Attention optimization, a widely-adopted technique used to speed up the attention mechanism in transformer models. They develop a microbenchmark to isolate and quantify the numeric deviation between Flash Attention and the baseline Attention implementation, and find that Flash Attention sees roughly an order of magnitude more numeric deviation at low numerical precision (BF16).
To contextualize the significance of this numeric deviation, the authors perform a data-driven analysis using the Wasserstein Distance metric to measure the changes in model weights throughout training. They find that the numeric deviation introduced by Flash Attention is 2-5 times less significant than the weight changes caused by low-precision training, a commonly used technique.
The authors also explore how various algorithm changes to Flash Attention, such as block size and dimension order, impact the observed numeric deviation. Larger block sizes are found to reduce the numeric deviation, as they require fewer rescaling calculations.
Overall, the authors develop a principled framework to quantify numeric deviation in training optimizations and provide insights into the potential impact of the Flash Attention optimization on model training stability.
Stats
Flash Attention sees roughly an order of magnitude more numeric deviation as compared to Baseline Attention at BF16.
The numeric deviation introduced by Flash Attention is 2-5 times less significant than the weight changes caused by low-precision training.
Quotes
"Flash Attention sees roughly an order of magnitude more numeric deviation as compared to Baseline Attention at BF16."
"The numeric deviation introduced by Flash Attention is 2-5 times less significant than the weight changes caused by low-precision training."