This comprehensive survey examines the interplay between algorithms and hardware in optimizing ViT inference. It first delves into the unique architectural attributes and runtime characteristics of ViTs, highlighting their computational bottlenecks.
The survey then explores the fundamental principles of model quantization, including linear quantization, symmetric/asymmetric quantization, and static/dynamic quantization. It provides a comparative analysis of state-of-the-art quantization techniques for ViTs, focusing on addressing the challenges associated with quantizing non-linear operations like softmax, layer normalization, and GELU.
The survey also examines hardware acceleration strategies for quantized ViTs, emphasizing the importance of hardware-friendly algorithm design. It discusses various calibration optimization methods for post-training quantization (PTQ) and gradient-based optimization techniques for quantization-aware training (QAT). The survey also covers specialized strategies for binary quantization of ViTs, which aims to achieve ultra-compact models with efficient bitwise operations.
Throughout the survey, the authors maintain a repository of related open-source materials to facilitate further research and development in this domain.
Til et andet sprog
fra kildeindhold
arxiv.org
Vigtigste indsigter udtrukket fra
by Dayou Du,Gu ... kl. arxiv.org 05-02-2024
https://arxiv.org/pdf/2405.00314.pdfDybere Forespørgsler