بازشناسی کارای کنشهای انسانی با محدود کردن فضای جستجو در روشهای یادگیری عمیق
الموضوعات :
مریم کوهزادی هیکویی
نصرالله مقدم چرکری
1 - دانشگاه تربیت مدرس
2 - دانشگاه تربیت مدرس
الکلمات المفتاحية: بازشناسی کنشهای انسانی, یادگیری عمیق, فضایی- زمانی, پیچیدگی محاسباتی, سازوکار انتخاب ویژگی,
ملخص المقالة :
کارایی سیستمهای بازشناسی کنشهای انسانی به استخراج بازنمایی مناسب از دادههای ویدئویی وابسته است. در سالهای اخیر روشهای یادگیری عمیق به منظور استخراج بازنمایی فضایی- زمانی کارا از دادههای ویدئویی ارائه شده است، در حالی که روشهای یادگیری عمیق در توسعه بعد زمان، پیچیدگی محاسباتی بالایی دارند. همچنین پراکندگی و محدودبودن دادههای تمایزی و عوامل نویزی زیاد، مشکلات محاسباتی بازنمایی کنشها را شدیدتر ساخته و قدرت تمایز را محدود مینماید. در این مقاله، شبکههای یادگیری عمیق فضایی و زمانی با افزودن سازوکارهای انتخاب ویژگی مناسب جهت مقابله با عوامل نویزی و کوچکسازی فضای جستجو، ارتقا یافتهاند. در این راستا، سازوکارهای انتخاب ویژگی غیر برخط و برخط، برای بازشناسی کنشهای انسانی با پیچیدگی محاسباتی کمتر و قدرت تمایز بالاتر مورد بررسی قرار گرفته است. نتایج نشان داد که سازوکار انتخاب ویژگی غیر برخط، منجر به کاهش پیچیدگی محاسباتی قابل ملاحظه میگردد و سازوکار انتخاب ویژگی برخط، ضمن کنترل پیچیدگی محاسباتی، منجر به افزایش قدرت تمایز میشود.
[1] A. Karpathy, et al., "Large-scale video classification with convolutional neural networks," in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, CVPR'14, pp. 1725-1732, Columbus, OH, USA, 23-28 Jun. 2014.
[2] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3D convolutional networks," in Proc. of the IEEE Int. Conf. on Computer Vision, pp. 4489-4497, Santiago, Chile, 7-13 Dec. 2015.
[3] L. Wang, et al., Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, Springer, 2016.
[4] L. Wang, et al., "Temporal segment networks for action recognition in videos," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 41, no. 11, pp. 2740- 2755, Nov. 2018.
[5] A. Diba, V. Sharma, and L. Van Gool, Deep Temporal Linear Encoding Networks, 2017.
[6] Z. Lan, et al., Deep local video feature for action recognition, 2017.
[7] W. Du, Y. Wang, and Y. Qiao, "Recurrent spatial-temporal attention network for action recognition in videos," IEEE Trans. on Image Processing, vol. 27, no. 3, pp. 1347-1360, Mar. 2017.
[8] Q. Liu, X. Che, and M. Bie, "R-STAN: residual spatial-temporal attention network for action recognition," IEEE Access, vol. 7, pp. 82246-82255, 2019.
[9] J. Li, X. Liu, M. Zhang, and D. Wang, "Spatio-temporal deformable 3D ConvNets with attention for action recognition," Pattern Recognition, vol. 98, Article ID: 107037, Feb. 2020.
[10] Y. Quan, Y. Chen, R. Xu, and H. Ji, "Attention with structure regularization for action recognition," Computer Vision and Image Understanding, vol. 187, Article ID: 102794, Oct. 2019.
[11] J. Zhang, H. Hu, and X. Lu, "Moving foreground-aware visual attention and key volume mining for human action recognition," ACM Trans. on Multimedia Computing, Communications, and Applications, vol. 15, no. 3, Article ID:. 74, 16 pp., Aug. 2019.
[12] H. Sang, Z. Zhao, and D. He, "Two-level attention model based video action recognition network," IEEE Access, vol. 7, pp. 118388-118401, 2019.
[13] S. Sharma, R. Kiros, and R. Salakhutdinov, Action Recognition Using Visual Attention, arXiv preprint arXiv:1511.04119, 2015.
[14] Y. Peng, Y. Zhao, and J. Zhang, "Two-stream collaborative learning with spatial-temporal attention for video classification," IEEE Trans. on Circuits and Systems for Video Technology, vol. 29, no. 3, pp. 773-786, Mar. 2018.
[15] D. Li, et al., "Unified spatio-temporal attention networks for action recognition in videos," IEEE Trans. on Multimedia, vol. 21, no. 2, pp. 416-428, Feb. 2018.
[16] H. Zhang, et al., "End-to-end temporal attention extraction and human action recognition," Machine Vision and Applications, vol. 29, no. 7, pp. 1127-1142, Oct. 2018.
[17] H. Ge, et al., "An attention mechanism based convolutional LSTM network for video action recognition," Multimedia Tools and Applications, vol. 78, pp. 20533-20556, Mar. 2019.
[18] M. Koohzadi and N. M. Charkari, "A context based deep temporal embedding network in action recognition," Neural Processing Letters, no. 1, 34 pp., 2020.
[19] M. Abadi, et al., "Tensorflow: a system for large-scale machine learning," in Proc. of the 12th USENIX Conf. on Operating Systems Design and Implementation, pp. 265-283, Savannah, GA, USA, 2-4 Nov. 2016.
[20] Z. Zhang, Z. Lvm C. Gan, and Q. Zhu, "Human action recognition using convolutional LSTM and fully-connected LSTM with different attentions," Neurocomputing, vol. 410, pp. 304-316, 14 Oct. 2020.
[21] J. Carreira, A. Zisserman, and Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, arXiv preprint arXiv:1705.07750, 2017.
[22] A. Diba, et al., Spatio-temporal channel correlation networks for action classification, 2018.
[23] J. Zhu, W. Zou, Z. Zhu, and L. Li, "End-to-end video-level representation learning for action recognition," in Proc. 24th Int. Conf on Pattern Recognition, pp. 645-650, Beijing, China, 20-24 Aug. 2018.
[24] Z. Li, K. Gavrilyuk, E.Gavves, M. Jain, C G. Snoekab, "VideoLSTM convolves, attends and flows for action recognition," Computer Vision and Image Understanding, vol. 166, pp. 41-50, 20-24 Jan. 2018.
[25] T. Yu, et al., "Joint spatial-temporal attention for action recognition," Pattern Recognition Letters, vol. 112, pp. 226-233, Jul. 2018.
[26] Z. Qiu, T. Yao, C. W. Ngo, X. Tian, and T. Mei, "Learning spatio-temporal representation with local and global diffusion," in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 12056-12065, Long Beach, CA, USA, 15-20 Jun. 2019.
[27] C. Feichtenhofer, H. Fan, J. Malik, and K. He, "Slowfast networks for video recognition," in Proc. of the IEEE/CVF Int. Conf. on Computer Vision, pp. 6202-6211, Seoul, South Korea, 27 Oct.-2 Nov. 2019.
[28] N. Crasto, P. Weinzaepfel, K. Alahari, and C. Schmid, "MARS: motion-augmented RGB stream for action recognition," in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 7874-7883, Long Beach, CA, USA, 15-20 Jun. 2019.
[29] C. Y. Ma, M. H. Chen, Z. Kirab, and G.n AlRegib, "TS-LSTM and temporal-inception: exploiting spatiotemporal dynamics for activity recognition," Signal Processing: Image Communication, vol. 1, pp. 76-87, 2019.
[30] B. Pang, K. Zha, H. Cao, C. Shi, and C. Lu, "Deep RNN framework for visual sequential applications," in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 423-432, Long Beach, CA, USA, 15-20 Jun. 2019.
[31] A. Piergiovanni, A. Angelova, A. Toshev, and M. S. Ryoo, "Evolving space-time neural architectures for videos," in Proc. of the IEEE In. Conf. on Computer Vision, pp. 1793-1802, Long Beach, CA, USA, 15-20 Jun. 2019.
[32] C. Zhuang, A. Andonian, and D. Yamins, Unsupervised Learning from Video with Deep Neural Embeddings, arXiv preprint arXiv:1905.11954, 2019.
[33] N. Sayed, B. Brattoli, and B. Ommer, Cross and Learn: Cross-Modal Self-Supervision, arXiv preprint arXiv:1811.03879, 2018.
[34] L. Meng, et al., "Interpretable spatio-temporal attention for video action recognition," in Proc. of the IEEE/CVF Int. Conf. on Computer Vision Workshops, , pp. 1513-1522, Seoul, South Korea, 27-28 Oct. 2019.
[35] C. Dai, X. Liu, and J. Lai, "Human action recognition using two-stream attention based LSTM networks," Applied Soft Computing, vol. 86, Article ID: 105820, Jan. 2019.
[36] L. Wang, et al., "Temporal segment networks: towards good practices for deep action recognition," in Proc. 14th European Conf., pp. 20-36, Amsterdam, The Netherlands, 11-14 October, 2016.