Enabling rich mixed-precision quantization schemes during the implementation of a CNN can open a previously hidden space of mappings that utilize the hardware resources more effectively than uniformly quantized layers accompanied by standard mappings. CNNs utilizing quantized weights and activations and suitable mappings can significantly improve trade-offs among the accuracy, energy, and memory requirements compared to less carefully optimized CNN implementations.
Product quantization (PQ) can eliminate multiply-accumulate operations in deep neural networks (DNNs) by replacing them with memory lookups of pre-computed dot products, offering potential for significant inference acceleration. This work presents the first comprehensive study of PQ for DNN acceleration, including the design of a custom hardware accelerator (PQA) that can achieve up to 3.1x speedup over a highly optimized conventional DNN accelerator.