核心概念
Vision Transformers (ViTs) have emerged as a promising alternative to convolutional neural networks (CNNs) in computer vision, but their large model sizes and high computational demands hinder deployment, especially on resource-constrained devices. Model quantization and hardware acceleration are crucial to address these challenges and enable efficient ViT inference.
摘要
This comprehensive survey examines the interplay between algorithms and hardware in optimizing ViT inference. It first delves into the unique architectural attributes and runtime characteristics of ViTs, highlighting their computational bottlenecks.
The survey then explores the fundamental principles of model quantization, including linear quantization, symmetric/asymmetric quantization, and static/dynamic quantization. It provides a comparative analysis of state-of-the-art quantization techniques for ViTs, focusing on addressing the challenges associated with quantizing non-linear operations like softmax, layer normalization, and GELU.
The survey also examines hardware acceleration strategies for quantized ViTs, emphasizing the importance of hardware-friendly algorithm design. It discusses various calibration optimization methods for post-training quantization (PTQ) and gradient-based optimization techniques for quantization-aware training (QAT). The survey also covers specialized strategies for binary quantization of ViTs, which aims to achieve ultra-compact models with efficient bitwise operations.
Throughout the survey, the authors maintain a repository of related open-source materials to facilitate further research and development in this domain.
统计
"The computational demands of ViTs, in terms of FLOPs and MOPs, increase more than proportionally with the size of the input image."
"Operations with an arithmetic intensity below 200 are recognized as memory-bound, limiting their performance potential on advanced GPUs like the RTX 4090."
"Adopting INT8 precision emerges as a crucial optimization in compute-bound situations, capitalizing on the enhanced efficiency and throughput of quantized computing."
引用
"The pivotal feature of ViTs, self-attention, allows the model to contextually analyze visual data by learning intricate relationships between elements within a sequence of image tokens."
"This combination of large model sizes and high computational and memory demands significantly hinders deployment on devices with constrained computational and memory resources, particularly in real-time applications such as autonomous driving and virtual reality."
"Quantization, a technique that maps higher precision into lower precision, has been successful in facilitating lightweight and computationally efficient models, enhancing the interaction between algorithms and hardware."