Multi-Level Ternary Quantization for Improving Sparsity and Computation in Embedded Deep Neural Networks
Subject Areas : AI and RoboticsHosna Manavi Mofrad 1 , ali ansarmohammadi 2 , Mostafa Salehi 3
1 - student
2 - PHD. student, Faculty of Electrical and Computer Engineering, Tehran University, Tehran, Irann
3 - University of Tehran
Keywords: Deep Neural Networks, Multi-Level Ternary Quantization, Sparse Neural Network, Pruning, Embedded Devices,
Abstract :
Deep neural networks (DNNs) have achieved great interest due to their success in various applications. However, the computation complexity and memory size are considered to be the main obstacles for implementing such models on embedded devices with limited memory and computational resources. Network compression techniques can overcome these challenges. Quantization and pruning methods are the most important compression techniques among them. One of the famous quantization methods in DNNs is the multi-level binary quantization, which not only exploits simple bit-wise logical operations, but also reduces the accuracy gap between binary neural networks and full precision DNNs. Since, multi-level binary can’t represent the zero value, this quantization does not take advantage of sparsity. On the other hand, it has been shown that DNNs are sparse, and by pruning the parameters of the DNNs, the amount of data storage in memory is reduced while computation speedup is also achieved. In this paper, we propose a pruning and quantization-aware training method for multi-level ternary quantization that takes advantage of both multi-level quantization and data sparsity. In addition to increasing the accuracy of the network compared to the binary multi-level networks, it gives the network the ability to be sparse. To save memory size and computation complexity, we increase the sparsity in the quantized network by pruning until the accuracy loss is negligible. The results show that the potential speedup of computation for our model at the bit and word-level sparsity can be increased by 15x and 45x compared to the basic multi-level binary networks.
[1] LeCun, Y., Y. Bengio, and G. Hinton, Deep learning. nature, 2015. 521(7553): p. 436-444.
[2] Zhang, D., et al. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. in Proceedings of the European conference on computer vision (ECCV). 2018.
[3] Yang, L., Z. He, and D. Fan. Harmonious coexistence of structured weight pruning and ternarization for deep neural networks. in Proceedings of the AAAI Conference on Artificial Intelligence. 2020.
[4] Burrello, A., et al., Dory: Automatic end-to-end deployment of real-world dnns on low-cost iot mcus. IEEE Transactions on Computers, 2021. 70(8): p. 1253-1268.
[5] de Prado, M., et al., Automated Design Space Exploration for Optimized Deployment of DNN on Arm Cortex-A CPUs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020. 40(11): p. 2293-2305.
[6] Howard, A.G., et al., Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
[7] Iandola, F.N., et al., SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint arXiv:1602.07360, 2016.
[8] Courbariaux, M., Y. Bengio, and J.-P. David, Binaryconnect: Training deep neural networks with binary weights during propagations. Advances in neural information processing systems, 2015. 28.
[9] Cai, Z., et al. Deep learning with low precision by half-wave gaussian quantization. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
[10] Li, F., B. Zhang, and B. Liu, Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.
[11] He, Y., X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. in Proceedings of the IEEE international conference on computer vision. 2017.
[12] Han, S., et al., Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 2015. 28.
[13] Luo, J.-H., J. Wu, and W. Lin. Thinet: A filter level pruning method for deep neural network compression. in Proceedings of the IEEE international conference on computer vision. 2017.
[14] Maji, P., et al. Efficient winograd or cook-toom convolution kernel implementation on widely used mobile cpus. in 2019 2nd Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2). 2019. IEEE.
[15] Andri, R., et al. YodaNN: An ultra-low power convolutional neural network accelerator based on binary weights. in 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). 2016. IEEE.
[16] Jacob, B., et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[17] Gysel, P., et al., Ristretto: A framework for empirical study of resource-efficient inference in convolutional neural networks. IEEE transactions on neural networks and learning systems, 2018. 29(11): p. 5784-5789.
[18] Sharma, H., et al. Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network. in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 2018. IEEE.
[19] Hubara, I., et al., Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 2017. 18(1): p. 6869-6898.
[20] Li, Y. and F. Ren. BNN Pruning: Pruning binary neural network guided by weight flipping frequency. in 2020 21st International Symposium on Quality Electronic Design (ISQED). 2020. IEEE.
[21] Jin, C., H. Sun, and S. Kimura. Sparse ternary connect: Convolutional neural networks using ternarized weights with enhanced sparsity. in 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC). 2018. IEEE.
[22] Zhu, C., et al., Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.
[23] Chin, T.-W., et al. One weight bitwidth to rule them all. in European Conference on Computer Vision. 2020. Springer.
[24] Ghasemzadeh, M., M. Samragh, and F. Koushanfar. ReBNet: Residual binarized neural network. in 2018 IEEE 26th annual international symposium on field-programmable custom computing machines (FCCM). 2018. IEEE.
[25] Zhao, Y., et al., Focused quantization for sparse cnns. Advances in Neural Information Processing Systems, 2019. 32.
[26] Long, X., et al., Learning sparse convolutional neural network via quantization with low rank regularization. IEEE Access, 2019. 7: p. 51866-51876.
[27] Long, X., et al. Low Bit Neural Networks with Channel Sparsity and Sharing. in 2022 7th International Conference on Image, Vision and Computing (ICIVC). 2022. IEEE.
[28] Gadosey, P.K., Y. Li, and P.T. Yamak. On pruned, quantized and compact CNN architectures for vision applications: an empirical study. in Proceedings of the International Conference on Artificial Intelligence, Information Processing and Cloud Computing. 2019.
[29] Albericio, J., et al., Cnvlutin: Ineffectual-neuron-free deep neural network computing. ACM SIGARCH Computer Architecture News, 2016. 44(3): p. 1-13.
[30] Han, S., et al., EIE: Efficient inference engine on compressed deep neural network. ACM SIGARCH Computer Architecture News, 2016. 44(3): p. 243-254.
[31] Parashar, A., et al., SCNN: An accelerator for compressed-sparse convolutional neural networks. ACM SIGARCH computer architecture news, 2017. 45(2): p. 27-40.
[32] Zhang, S., et al. Cambricon-X: An accelerator for sparse neural networks. in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 2016. IEEE.
[33] Delmas, A., et al., DPRed: Making typical activation and weight values matter in deep learning computing. arXiv preprint arXiv:1804.06732, 2018.
[34] Aimar, A., et al., NullHop: A flexible convolutional neural network accelerator based on sparse representations of feature maps. IEEE transactions on neural networks and learning systems, 2018. 30(3): p. 644-656.
[35] Li, J., et al., SqueezeFlow: A sparse CNN accelerator exploiting concise convolution rules. IEEE Transactions on Computers, 2019. 68(11): p. 1663-1677.
[36] Albericio, J., et al. Bit-pragmatic deep neural network computing. in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 2017.
[37] Sharify, S., et al. Laconic deep learning inference acceleration. in 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). 2019. IEEE.
[38] 38. Chen, Y.-H., et al., Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE journal of solid-state circuits, 2016. 52(1): p. 127-138.
[39] Sandler, M., et al. Mobilenetv2: Inverted residuals and linear bottlenecks. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[40] Zhang, X., et al. Shufflenet: An extremely efficient convolutional neural network for mobile devices. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[41] Tan, M., et al. Mnasnet: Platform-aware neural architecture search for mobile. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
[42] Wu, B., et al. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
[43] Elsken, T., J.H. Metzen, and F. Hutter, Efficient multi-objective neural architecture search via lamarckian evolution. arXiv preprint arXiv:1804.09081, 2018.
[44] Rastegari, M., et al. Xnor-net: Imagenet classification using binary convolutional neural networks. in European conference on computer vision. 2016. Springer.
[45] Zhang, J., F. Franchetti, and T.M. Low. High performance zero-memory overhead direct convolutions. in International Conference on Machine Learning. 2018. PMLR.
[46] Lavin, A. and S. Gray. Fast algorithms for convolutional neural networks. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
[47] De Prado, M., N. Pazos, and L. Benini. Learning to infer: RL-based search for DNN primitive selection on Heterogeneous Embedded Systems. in 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE). 2019. IEEE.
[48] Rovder, S., J. Cano, and M. O’Boyle, Optimising convolutional neural networks inference on low-powered GPUs. 2019.
[49] Lin, X., C. Zhao, and W. Pan, Towards accurate binary convolutional neural network. Advances in neural information processing systems, 2017. 30.
[50] Gholami, A., et al., A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630, 2021.
[51] Hawks, B., et al., Ps and qs: Quantization-aware pruning for efficient low latency neural network inference. Frontiers in Artificial Intelligence, 2021. 4: p. 676564.
[52] Hubara, I., et al., Binarized neural networks. Advances in neural information processing systems, 2016. 29.