Human Action Recognition in Still Image of Human Pose using Multi-Stream neural Network
Subject Areas : electrical and computer engineeringRoghayeh Yousefi 1 , K. Faez 2
1 -
2 -
Keywords: Deep neural networkhuman action recognitionpose estimationthree-stream network,
Abstract :
Today, human action recognition in still images has become one of the active topics in computer vision and pattern recognition. The focus is on identifying human action or behavior in a single static image. Unlike the traditional methods that use videos or a sequence of images for human action recognition, still images do not involve temporal information. Therefore, still image-based action recognition is more challenging compared to video-based recognition. Given the importance of motion information in action recognition, the Im2flow method has been used to estimate motion information from a static image. To do this, three deep neural networks are combined together, called a three-stream neural network. The proposed structure of this paper, namely the three-stream network, stemmed from the combination of three deep neural networks. The first, second and third networks are trained based on the raw color image, the optical flow predicted by the image, and the human pose obtained in the image, respectively. In other words, in this study, in addition to the predicted spatial and temporal information, the information on human pose is also used for human action recognition due to its importance in recognition performance. Results revealed that the introduced three-stream neural network can improve the accuracy of human action recognition. The accuracy of the proposed method on Willow7 action, Pascal voc2012, and Stanford10 data sets were 91.8%, 91.02%, and 96.97%, respectively, which indicates the promising performance of the introduced method compared to state-of-the-art performance.
[1] G. Guo and A. Lai, "A survey on still image based human action recognition," Pattern Recognition, vol. 47, no. 10, pp. 3343-33612014.
[2] Z. Zhao, H. Ma, and S. You, "Single image action recognition using semantic body part actions," in Proc. IEEE Int. Conf. on Computer Vision, ICCV’17, pp. 3391-3399, Venice, Italy, 17-21 Jul. 2017.
[3] K. Simonyan and V. Zisserman, "Two-stream convolutional networks for action recognition in videos," in Proc. Advances in Neural Information Processing Systems, NIPS’14, 9 pp., Montreal, Canada, 8 Dec. 2014
[4] http://host.robots.ox.ac.uk/pascal/VOC/voc2012/
[5] https://www.di.ens.fr/willow/research/stillactions
[6] http://vision.stanford.edu/Datasets/40actions.html
[7] L. Zhang, L. Changxi, P. Peipei, X. Xuezhi, and S. Jingkuan, "Towards optimal VLAD for human action recognition from still images," in Proc. IEEE Int. Acoustics, Speech and Signal Processing Conf., ICASSP’16, pp. 53-63, Shanghai, China, 20-25 Mar. 2016.
[8] Y. Tani and K. Hotta, "Robust human detection to pose and occlusion using bag-of-words," in Proc. Inte. Conf. on Pattern Recognition, ICPR’’14, pp. 4376-4381, Stockholm, Sweden, 24-28 Aug. 2014.
[9] F. S. Khan, et al., "Coloring action recognition in still images," International J. of Computer Vision, vol. 105, no. 3, pp. 205-221, Dec. 2013.
[10] G. Gkioxari, R. Girshick, and J. Malik, "Actions and attributes from wholes and parts," in Proc. of the IEEE Int. Conf. on Computer Vision, pp. 2470-2478, Santiago, Chile, 7-13 Dec. 2015.
[11] V. Yao, X. Jiang, and A. Khosla, "Human action recognition by learning bases of action attributes and parts," in Proc. of ICCV, pp. 1331-1338, Barcelona, Spain, 6-13 Nov. 2011.
[12] V. Delaitre, J. Sivic, and I. Laptev, "Learning person-object interactions for action recognition in still images," in Proc. Advances in Neural Information Processing Systems, NIPS’11, pp. 1503-1511, Granada, Spain, 12-17 Dec. 2011.
[13] B. Yao and L. Fei-Fei, "Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 34, no. 9, pp.1691-1703, Sept. 2012.
[14] A. Prest, C. Schmid, and V. Ferrari, "Weakly supervised learning of interactions between humans and objects," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 34, no. 3, pp. 601-614, Mar. 2012.
[15] L. Zhujin, X. Wang, R. Huang, and L. Lin, "An expressive deep model for human action parsing from a single image," in Proc. IEEE Int. Conf. on Multimedia and Expo, ICME’14, 6 pp., Chengdu, China, 14-18 Jul. 2014.
[16] G. Sharma, F. Jurie, and C. Schmid, "Expanded parts model for semantic description of humans in still images," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 39, no. 1, pp. 87-101, Mar. 2016.
[17] Z. Zhao, H. Ma, and S. You, "Single image action recognition using semantic body part actions," in Proc. IEEE Int. Conf. on Computer Vision, ICCV’17, pp. 3391-3399, Venice, Italy, 22-29 Oct. 2017.
[18] W. Yang, Y. Wang, and G. Mori, "Recognizing human actions from still images with latent poses," Computer Vision and Pattern Recognition, CVPR’10, pp. 2030-2037, San Francisco, CA, USA, 15-17 Jun. 2010.
[19] Y. Zheng, Y. J. Zhang, X. Li, and B. D. Liu, "Action recognition in still images using a combination of human pose and context information," in Proc. 19th IEEE Int. Conf. on Image Processing, ICIP’12, pp. 785-788, Orlando, FL, USA, 30 Sept.-3 Oct. 2012.
[20] G. Sharma, F. Jurie, and C. Schmid, "Expanded parts model for human attribute and action recognition in still images," in the IEEE Conf. on Computer Vision and Pattern Recognition, CVPR’13, pp. 652-659, Portland, ON, USA, 25-27 Jun. 2013.
[21] B. C. Ko, J. H. Hong, and J. Y. Nam, "Human action recognition in still images using action poselets and a two-layer classification model," J. of Visual Languages & Computing, vol. 28, no. 1, pp. 163-175, Jun. 2015.
[22] Y. Zhang, L. Cheng, J. Wu, and J. Cai, "Action recognition in still images with minimum annotation efforts," IEEE Trans. on Image Processing, vol. 25, no. 11, pp. 5479-5490, Nov 2016.
[23] G. Gkioxari, R. Girshick, and J. Malik, "Contextual action recognition with r*cnn," in Proc. of the IEEE Int. Conf. on Computer Vision, pp. 1080-1088, Santiago, Chile, 7-13 Dec. Dec. 2015.
[24] M. Safaei and H. Foroosh, "Single image action recognition by predicting space-time saliency," arXiv:1705.04641v1, 12 May 2017.
[25] M. Safaei, P. Balouchian, and H. Foroosh, "TICNN: a hierarchical deep learning framework for still image action recognition using temporal image prediction," in Proc 25th IEEE Int. Conf. on Image Processing, ICIP’18, pp. 3463-3467, Athens, Greece, 7-10 Oct. 2018.
[26] M. Safaei and H. Foroosh, "A zero-shot architecture for action recognition in still images," in Proc 25th IEEE Int. Conf. on Image Processing, ICIP’18, pp. 460-464, Athens, Greece, 7-10 Oct. 2018.
[27] R. Gao, B. Xiong, and K. Grauman, "Im2flow: motion hallucination from static images for action recognition," in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 5937-5947, Istanbul, Turkey, 18-22 Jun. 2018.
[28] V. Badrinarayanan, A. Kendall, and R. Cipolla, Segnet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation, arXiv preprint arXiv:1511.00561, 2015.
[29] C. Feichtenhofer, A. Pinz, and A. Zisserman, "Convolutional two-stream network fusion for video action recognition," in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1933-1941, Las Vegas, NV, USA, 27-30 Jun. 2016.