Body Field: Structured Mean Field with Human Body Skeleton Model and Shifted Gaussian Edge Potentials
محورهای موضوعی : Image ProcessingSara Ershadi-Nasab 1 , Shohreh Kasaei 2 , Esmaeil Sanaei 3 , Erfan Noury 4 , Hassan Hafez-kolahi 5
1 - Sharif University
2 - Sharif University
3 - Sharif University
4 - Sharif University
5 - Sharif University
کلید واژه: Human Body Parts , Skeleton Model , Mean Field Approximation , Pose Estimation , Segmentation , Shifted Gaussian kernel,
چکیده مقاله :
An efficient method for simultaneous human body part segmentation and pose estimation is introduced. A conditional random field with a fully-connected graphical model is used. Possible node (image pixel) labels comprise of the human body parts and the background. In the human body skeleton model, the spatial dependencies among body parts are encoded in the definition of pairwise energy functions according to the conditional random fields. Proper pairwise edge potentials between image pixels are defined according to the presence or absence of human body parts that are near to each other. Various Gaussian kernels in position, color, and histogram of oriented gradients spaces are used for defining the pairwise energy terms. Shifted Gaussian kernels are defined between each two body parts that are connected to each other according to the human body skeleton model. As shifted Gaussian kernels impose a high computational cost to the inference, an efficient inference process is proposed by a mean field approximation method that uses high dimensional shifted Gaussian filtering. The experimental results evaluated on the challenging KTH Football, Leeds Sports Pose, HumanEva, and Penn-Fudan datasets show that the proposed method increases the per-pixel accuracy measure for human body part segmentation and also improves the probability of correct parts metric of human body joint locations.
An efficient method for simultaneous human body part segmentation and pose estimation is introduced. A conditional random field with a fully-connected graphical model is used. Possible node (image pixel) labels comprise of the human body parts and the background. In the human body skeleton model, the spatial dependencies among body parts are encoded in the definition of pairwise energy functions according to the conditional random fields. Proper pairwise edge potentials between image pixels are defined according to the presence or absence of human body parts that are near to each other. Various Gaussian kernels in position, color, and histogram of oriented gradients spaces are used for defining the pairwise energy terms. Shifted Gaussian kernels are defined between each two body parts that are connected to each other according to the human body skeleton model. As shifted Gaussian kernels impose a high computational cost to the inference, an efficient inference process is proposed by a mean field approximation method that uses high dimensional shifted Gaussian filtering. The experimental results evaluated on the challenging KTH Football, Leeds Sports Pose, HumanEva, and Penn-Fudan datasets show that the proposed method increases the per-pixel accuracy measure for human body part segmentation and also improves the probability of correct parts metric of human body joint locations.
[1] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele, "Deepercut: A deeper, stronger, and faster multi-person pose estimation model," in European Conference on Computer Vision, 2016: Springer, pp. 34-50.
[2] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang, "Multi-context attention for human pose estimation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1831-1840.
[3] A. Bulat and G. Tzimiropoulos, "Human pose estimation via convolutional part heatmap regression," in European Conference on Computer Vision, 2016: Springer, pp. 717-732.
[4] S. Ershadi-Nasab, S. Kasaei, and E. Sanaei, "Regression-based convolutional 3D pose estimation from single image," Electronics Letters, vol. 54, no. 5, pp. 292-293, 2018.
[5] S. E. Nasab, S. Kasaei, E. Sanaei, A. Ossia, and M. Mobini, "Multiview 3D reconstruction and human point cloud classification," in 2014 22nd Iranian Conference on Electrical Engineering (ICEE), 2014: IEEE, pp. 1119-1124.
[6] L. Wang, J. Shi, G. Song, and I.-f. Shen, "Object detection combining recognition and segmentation," in Asian conference on computer vision, 2007: Springer, pp. 189-199.
[7] L. Sigal, A. O. Balan, and M. J. Black, "Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion," International journal of computer vision, vol. 87, no. 1-2, p. 4, 2010.
[8] S. Johnson and M. Everingham, "Learning effective human pose estimation from inaccurate annotation," in CVPR 2011, 2011: IEEE, pp. 1465-1472.
[9] V. Kazemi, M. Burenius, H. Azizpour, and J. Sullivan, "Multi-view body part recognition with random forests," in 2013 24th British Machine Vision Conference, BMVC 2013; Bristol; United Kingdom; 9 September 2013 through 13 September 2013, 2013: British Machine Vision Association.
[10] F. Xia, J. Zhu, P. Wang, and A. L. Yuille, "Pose-guided human parsing by an and/or graph using pose-context features," in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
[11] P. Krähenbühl and V. Koltun, "Efficient inference in fully connected crfs with gaussian edge potentials," in Advances in neural information processing systems, 2011, pp. 109-117.
[12] A. Adams, J. Baek, and M. A. Davis, "Fast high‐dimensional filtering using the permutohedral lattice," in Computer Graphics Forum, 2010, vol. 29, no. 2: Wiley Online Library, pp. 753-762.
[13] N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection," 2005.
[14] V. Vineet, G. Sheasby, J. Warrell, and P. H. Torr, "Posefield: An efficient mean-field based method for joint estimation of human pose, segmentation, and depth," in International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, 2013: Springer, pp. 180-194.
[15] M. Kiefel and P. V. Gehler, "Human pose estimation with fields of parts," in European Conference on Computer Vision, 2014: Springer, pp. 331-346.
[16] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, "Convolutional pose machines," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4724-4732.
[17] Y. Bo and C. C. Fowlkes, "Shape-based pedestrian parsing," in CVPR 2011, 2011: IEEE, pp. 2265-2272.
[18] I. Rauschert and R. T. Collins, "A generative model for simultaneous estimation of human body shape and pixel-level segmentation," in European Conference on Computer Vision, 2012: Springer, pp. 704-717.
[19] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille, "Joint object and part segmentation using deep learned potentials," in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1573-1581.
[20] P. Luo, X. Wang, and X. Tang, "Pedestrian parsing via deep decompositional network," in Proceedings of the IEEE international conference on computer vision, 2013, pp. 2648-2655.
[21] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, "Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs," IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834-848, 2017.
[22] R. Alp Güler, N. Neverova, and I. Kokkinos, "Densepose: Dense human pose estimation in the wild," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7297-7306.
[23] Y. Chen, C. Shen, X.-S. Wei, L. Liu, and J. Yang, "Adversarial posenet: A structure-aware convolutional network for human pose estimation," in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1212-1221.
[24] X. Peng, Z. Tang, F. Yang, R. S. Feris, and D. Metaxas, "Jointly optimize data augmentation and network training: Adversarial data augmentation in human pose estimation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2226-2234.
[25] W. Yang, W. Ouyang, X. Wang, J. Ren, H. Li, and X. Wang, "3d human pose estimation in the wild by adversarial learning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5255-5264.
[26] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun, "Cascaded pyramid network for multi-person pose estimation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7103-7112.
[27] M. Andriluka et al., "Posetrack: A benchmark for human pose estimation and tracking," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5167-5176.
[28] J. M. Mooij, "libDAI: A free and open source C++ library for discrete approximate inference in graphical models," Journal of Machine Learning Research, vol. 11, no. Aug, pp. 2169-2173, 2010.
[29] A. Adams and J. Dolson, "ImageStack Library," https://github.com/abadams/ImageStack.
[30] J. Baek, A. Adams, and J. Dolson, "Lattice-based high-dimensional gaussian filtering and the permutohedral lattice," Journal of mathematical imaging and vision, vol. 46, no. 2, pp. 211-237, 2013.
[31] Y. Yang and D. Ramanan, "Articulated human detection with flexible mixtures of parts," IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 12, pp. 2878-2890, 2012.
[32] V. Belagiannis, C. Amann, N. Navab, and S. Ilic, "Holistic human pose estimation with regression forests," in International Conference on Articulated Motion and Deformable Objects, 2014: Springer, pp. 20-30.