Improvement in Accuracy and Speed of Image Semantic Segmentation via Convolution Neural Network Encoder-Decoder
الموضوعات :Hanieh Zamanian 1 , Hassan Farsi 2 , Sajad Mohammadzadeh 3
1 - University of Birjand
2 - Birjand University
3 - University of Birjand
الکلمات المفتاحية: semantic segmentation , convolutional neural networks , encoder –decoder , Pixelwise Semantic Interpretation , ,
ملخص المقالة :
Recent researches on pixel-wise semantic segmentation use deep neural networks to improve accuracy and speed of these networks in order to increase the efficiency in practical applications such as automatic driving. These approaches have used deep architecture to predict pixel tags, but the obtained results seem to be undesirable. The reason for these unacceptable results is mainly due to the existence of max pooling operators, which reduces the resolution of the feature maps. In this paper, we present a convolutional neural network composed of encoder-decoder segments based on successful SegNet network. The encoder section has a depth of 2, which in the first part has 5 convolutional layers, in which each layer has 64 filters with dimensions of 3×3. In the decoding section, the dimensions of the decoding filters are adjusted according to the convolutions used at each step of the encoding. So, at each step, 64 filters with the size of 3×3 are used for coding where the weights of these filters are adjusted by network training and adapted to the educational data. Due to having the low depth of 2, and the low number of parameters in proposed network, the speed and the accuracy improve compared to the popular networks such as SegNet and DeepLab. For the CamVid dataset, after a total of 60,000 iterations, we obtain the 91% for global accuracy, which indicates improvements in the efficiency of proposed method.
[1] A. a. M. Ess, Tobias and Grabner, Helmut and Gool, Luc van, “Segmentation-Based Urban Traffic Scene Understanding,” Proceedings of the British Machine Vision Conference, pp. 84.1-84.11, 2009.
[2] A. Geiger, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361, 2012.
[3] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, "The Cityscapes Dataset for Semantic Urban Scene Understanding." in 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223, 2016.
[4] M. Oberweger, P. Wohlhart, and V. Lepetit, “Hands Deep in Deep Learning for Hand Pose Estimation,” CoRR, vol. abs/1502.06807, 2015.
[5] Y. Yoon, H.-G. Jeon, D. Yoo, J.-Y. Lee, and I. S. Kweon, “Learning a Deep Convolutional Network for Light-Field Image Super-Resolution,” in IEEE International Conference on Computer Vision Workshop (ICCVW), Santiago, Chile, pp. 57-65, 2015.
[6] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang, and J. Li, “Deep Learning for Content-Based Image Retrieval: A Comprehensive Study,” in Proceedings of the 22nd ACM international conference on Multimedia, Orlando, Florida, USA, pp. 157-166, 2014.
[7] F. Ning, D. Delhomme, Y. LeCun, F. Piano, L. Bottou, and P. E. Barbano, “Toward automatic phenotyping of developing embryos from videos,” Transaction of Image. Processing, vol. 14, no. 9, pp. 1360-1371, 2005.
[8] D. C. Cire, #351, an, A. Giusti, L. M. Gambardella, #252, and r. Schmidhuber, “Deep neural networks segment neuronal membranes in electron microscopy images,” in Proceedings of the 25th International Conference on Neural Information Processing Systems, Vol. 2, Lake Tahoe, Nevada, pp. 2843-2851, 2012.
[9] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning Hierarchical Features for Scene Labeling,” IEEE Transaction of Pattern Analysis, Machine Intelligence, vol. 35, no. 8, pp. 1915-1929, 2013.
[10] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, "Simultaneous Detection and Segmentation," Computer Vision – ECCV 2014. pp. 297-312, 2014.
[11] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, "Learning Rich Features from RGB-D Images for Object Detection and Segmentation," Computer Vision – ECCV 2014. pp. 345-360, 2014.
[12] S. Bittel, V. Kaiser, M. Teichmann, and M. Thoma, “Pixel-wise Segmentation of Street with Neural Networks,” CoRR, vol. abs/1511.00513, 2015.
[13] M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: Active contour models,” International Journal of Computer Vision, vol. 1, no. 4, pp. 321-331, January 01, 1988.
[14] M. D. Levine, and S. I. Shaheen, “A Modular Computer Vision System for Picture Segmentation and Interpretation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 3, no. 5, pp. 540-556, 1981.
[15] T. K. Ho, “Random decision forests,” in Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Vol. 1, pp. 278, 1995.
[16] E. Shelhamer, J. Long, and T. Darrell, “Fully Convolutional Networks for Semantic Segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence., vol. 39, no. 4, pp. 640-651, 2017.
[17] V. Badrinarayanan, A. Kendall, and RobertoCipolla, “ SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, pp. 2481-2495, 2017.
[18] F. Yu, and V. Koltun, “Multi-Scale Context Aggregation by Dilated Convolutions,” CoRR, vol. abs/1511.07122, 2015.
[19] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs,” CoRR, vol. abs/1412.7062, 2014.
[20] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs,” CoRR, vol. abs/1606.00915, 2016.
[21] G. Lin, A. Milan, C. Shen, and I. D. Reid, “RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation,” CoRR, vol. abs/1611.06612, 2016.
[22] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid Scene Parsing Network,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6230-6239, 2017.
[23] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, "Large Kernel Matters — Improve Semantic Segmentation by Global Convolutional Network." IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1743-1751, 2017.
[24] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking Atrous Convolution for Semantic Image Segmentation,” CoRR, vol. abs/1706.05587, 2017.
[25] G. J. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object classes in video: A high-definition ground truth database,” Pattern Recogn. Lett., vol. 30, no. 2, pp. 88-97, 2009.
[26] A. Ahmed, K. Yu, W. Xu, Y. Gong, and E. Xing, "Training Hierarchical Feed-Forward Visual Recognition Models Using Transfer Learning from Pseudo-Tasks," Computer Vision – ECCV 2008. pp. 69-82, 2008.
[27] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks,” in Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1717-1724, 2014.
[28] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?,” in Proceedings of the 27th International Conference on Neural Information Processing Systems, vol. 2, Montreal, Canada, pp. 3320-3328, 2014.
[29] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248-255, 2009.
[30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211-252, December 01, 2015.
[31] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proceedings of the 25th International Conference on Neural Information Processing Systems, vol. 1, Lake Tahoe, Nevada, pp. 1097-1105, 2012.
[32] K. Simonyan, and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” CoRR, vol. abs/1409.1556, 2014.
[33] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, pp. 1-9, 2015.
[34] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 770–778, 2016.
[35] M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptive deconvolutional networks for mid and high level feature learning,” in Proceedings of the 2011 International Conference on Computer Vision, pp. 2018-2025, 2011.
[36] M. D. Zeiler, and R. Fergus, "Visualizing and Understanding Convolutional Networks," Computer Vision – ECCV 2014. pp. 818-833, 2014.
[37] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “ENet: {A} Deep Neural Network Architecture for Real-Time Semantic Segmentation,” CoRR, vol. abs/1606.02147, 2016.
[38] G. Nanfack, A. Elhassouny, and R. O. H. Thami, “Squeeze-SegNet: {A} new fast Deep Convolutional Neural Network for Semantic Segmentation,” CoRR, vol. abs/1711.05491, 2017.
[39] D. Eigen, and R. Fergus, “Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture,” in Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2650-2658, 2015.
[40] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The Pascal Visual Object Classes Challenge: A Retrospective,” International Journal of Computer Vision, vol. 111, no. 1, pp. 98-136, January 01, 2015.
[41] J. Tighe, and S. Lazebnik, "SuperParsing: Scalable Nonparametric Image Parsing with Superpixels," Computer Vision – ECCV 2010. pp. 352-365, 2010.
[42] H. Noh, S. Hong, and B. Han, “Learning Deconvolution Network for Semantic Segmentation,” in Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1520-1528, 2015.