Authors: Yongkweon Jeon, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Jeongin Yun, and Dongsoo Lee (Samsung)
Abstract: The number of parameters in deep neural networks (DNNs) is rapidly increasing to support complicated tasks and to improve model accuracy. Correspondingly, the amount of computations and required memory footprint increase as well. Quantization is an efficient method to address such concerns. Unfortunately, commercial processors do not fully support quantization because only fixed data transfers (such as 32 bits) are allowed. Success of quantization in practice, hence, relies on an efficient computation engine design, especially for matrix multiplication. In this paper, we propose a novel matrix multiplication method, called BiQGEMM, dedicated to quantized DNNs. BiQGEMM can access multiple quantized weights simultaneously in one instruction. In addition, BiQGEMM pre-computes intermediate results that are highly redundant when quantization leads to limited available computation space. Our extensive experimental results show that BiQGEMM presents higher performance than conventional schemes when DNNs are quantized.
Back to Technical Papers Archive Listing