This comprehensive survey examines the interplay between algorithms and hardware in optimizing ViT inference. It first delves into the unique architectural attributes and runtime characteristics of ViTs, highlighting their computational bottlenecks.
The survey then explores the fundamental principles of model quantization, including linear quantization, symmetric/asymmetric quantization, and static/dynamic quantization. It provides a comparative analysis of state-of-the-art quantization techniques for ViTs, focusing on addressing the challenges associated with quantizing non-linear operations like softmax, layer normalization, and GELU.
The survey also examines hardware acceleration strategies for quantized ViTs, emphasizing the importance of hardware-friendly algorithm design. It discusses various calibration optimization methods for post-training quantization (PTQ) and gradient-based optimization techniques for quantization-aware training (QAT). The survey also covers specialized strategies for binary quantization of ViTs, which aims to achieve ultra-compact models with efficient bitwise operations.
Throughout the survey, the authors maintain a repository of related open-source materials to facilitate further research and development in this domain.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Dayou Du,Gu ... a las arxiv.org 05-02-2024
https://arxiv.org/pdf/2405.00314.pdfConsultas más profundas