Hierarchical Weighted Framework for Emotional Distress Detection using Personalized Affective Cues
محورهای موضوعی : Machine learning
1 - Pune University, India
کلید واژه: Convolution Neural Network, Long Short-Term Memory, Transformers, Hierarchical Fusion, Distress Detection,
چکیده مقاله :
Emotional distress detection has become a hot topic of research in recent years due to concerns related to mental health and complex nature distress identification. One of the challenging tasks is to use non-invasive technology to understand and detect emotional distress in humans. Personalized affective cues provide a non-invasive approach considering visual, vocal, and verbal cues to recognize the affective state. In this paper, we are proposing a multimodal hierarchical weighted framework to recognize emotional distress. We are utilizing negative emotions to detect the unapparent behavior of the person. To capture facial cues, we have employed hybrid models consisting of a transfer learned residual network and CNN models. Extracted facial cue features are processed and fused at decision using a weighted approach. For audio cues, we employed two different models exploiting the LSTM and CNN capabilities fusing the results at the decision level. For textual cues, we used a BERT transformer to learn extracted features. We have proposed a novel decision level adaptive hierarchical weighted algorithm to fuse the results of the different modalities. The proposed algorithm has been used to detect the emotional distress of a person. Hence, we have proposed a novel algorithm for the detection of emotional distress based on visual, verbal, and vocal cues. Experiments on multiple datasets like FER2013, JAFFE, CK+, RAVDESS, TESS, ISEAR, Emotion Stimulus dataset, and Daily-Dialog dataset demonstrates the effectiveness and usability of the proposed architecture. Experiments on the enterface'05 dataset for distress detection has demonstrated significant results.
Emotional distress detection has become a hot topic of research in recent years due to concerns related to mental health and complex nature distress identification. One of the challenging tasks is to use non-invasive technology to understand and detect emotional distress in humans. Personalized affective cues provide a non-invasive approach considering visual, vocal, and verbal cues to recognize the affective state. In this paper, we are proposing a multimodal hierarchical weighted framework to recognize emotional distress. We are utilizing negative emotions to detect the unapparent behavior of the person. To capture facial cues, we have employed hybrid models consisting of a transfer learned residual network and CNN models. Extracted facial cue features are processed and fused at decision using a weighted approach. For audio cues, we employed two different models exploiting the LSTM and CNN capabilities fusing the results at the decision level. For textual cues, we used a BERT transformer to learn extracted features. We have proposed a novel decision level adaptive hierarchical weighted algorithm to fuse the results of the different modalities. The proposed algorithm has been used to detect the emotional distress of a person. Hence, we have proposed a novel algorithm for the detection of emotional distress based on visual, verbal, and vocal cues. Experiments on multiple datasets like FER2013, JAFFE, CK+, RAVDESS, TESS, ISEAR, Emotion Stimulus dataset, and Daily-Dialog dataset demonstrates the effectiveness and usability of the proposed architecture. Experiments on the enterface'05 dataset for distress detection has demonstrated significant results.
[1] Gu Simeng, Wang Fushun, Patel Nitesh P., Bourgeois James A., Huang Jason H, “A Model for Basic Emotions Using Observations of Behavior in Drosophila,” Frontiers in Psychology, vol. 10, 2019, pp.781.
[2] Rana R, Latif S, Gururajan R, Gray A, Mackenzie G, Humphris G, Dunn J, “Automated screening for distress: A perspective for the future,” The European Journal of Cancer Care, vol. 28(4), 2019, pp. 1-13.
[3] Riba, M. B. et al., “Distress Management Version 3.2019 ,” NCCN Clinical Practice Guidelines in Oncology, Journal of the National Comprehensive Cancer Network, vol.17(10), 2019, pp.1229–1249.
[4] A. Mehrabian and S.R. Ferris, “Inference of attitudes from nonverbal communication in two channels,” International Journal of consulting psychology, vol. 31(3), 1967, pp. 248–252.
[5] T. Thomas, M. Domínguez and R. Ptucha, "Deep independent audio-visual affect analysis," in 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2017, pp. 1417-1421.
[6] J. A. Miranda, M. F. Canabal, J. M. Lanza-Gutiérrez, M. P. García and C. López-Ongil, "Toward Fear Detection using Affect Recognition," in 2019 XXXIV Conference on Design of Circuits and Integrated Systems (DCIS), November 2019, pp. 1-4.
[7] Lang He, Dongmei Jiang, and Hichem Sahli. “Multimodal depression recognition with dynamic visual and audio cues,” Proc. 2015 International Conference on Affective Computing and Intelligent Interaction, IEEE Computer Society, USA, 2015, pp. 260–266.
[8] Guangxia Xu, Weifeng Li, Jun Liu, “A social emotion classification approach using multi-model fusion,” Future Generation Computer Systems, vol. 102, 2020, pp. 347-356.
[9]Zeinab Farhoudi, Saeed Setayeshi, “Fusion of deep learning features with mixture of brain emotional learning for audio-visual emotion recognition,” Speech Communication,” vol. 127, 2021, pp. 92-103.
[10] Do LN., Yang HJ., Nguyen, HD. et al. “Deep neural network-based fusion model for emotion recognition using visual data,” Journal of Supercomputing, vol.77, 2021, pp. 1-18.
[11] Neha Jain, Shishir Kumar, Amit Kumar, Pourya Shamsolmoali, Masoumeh Zareapoor, “Hybrid deep neural networks for face emotion recognition,” Pattern Recognition Letters, vol. 115, 2018, pp. 101-106.
[12] Heysem Kaya, Furkan Grpnar, and Albert Ali Salah, “Video-based emotion recognition in the wild using deep transfer learning and score fusion,” Image Vision Computing, vol. 65, 2017, pp. 66–75.
[13] Shiqing Zhang, Xin Tao, Yuelong Chuang, Xiaoming Zhao, “Learning deep multimodal affective features for spontaneous speech emotion recognition, Speech Communication,” vol. 127, 2021, pp. 73-81.
[14] Jingwei Yan, Wenming Zheng, Zhen Cui, Chuangao Tang, Tong Zhang, Yuan Zong, “Multi-cue fusion for emotion recognition in the wild,” Neurocomputing, vol. 309, 2018, pp. 27-35.
[15] Ilyes Bendjoudi, Frederic Vanderhaegen, Denis Hamad, Fadi Dornaika, “Multi-label, multi-task CNN approach for context-based emotion recognition,” Information Fusion, November 2020, in press.
[16] Man Hao, Wei-Hua Cao, Zhen-Tao Liu, Min Wu, Peng Xiao, “Visual-audio emotion recognition based on multi-task and ensemble learning with multiple features,” Neurocomputing, vol. 391, 2020, pp. 42-51.
[17] Tzirakis, P., Trigeorgis, G., Nicolaou, M.A., Schuller, B., Zafeiriou, S., “End-to-End Multimodal Emotion Recognition Using Deep Neural Networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, 2017, pp. 1301-1309.
[18] N. Majumder, D. Hazarika, A. Gelbukh, E. Cambria, S. Poria, “Multimodal sentiment analysis using hierarchical fusion with context modeling, Knowledge-Based Systems,” vol. 161, 2018, pp. 124-133.
[19] Soujanya Poria, Erik Cambria, Amir Hussain, Guang-Bin Huang, “Towards an intelligent framework for multimodal affective data analysis,” Neural Networks, vol. 63, 2015, pp. 104-116.
[20] Yaxiong Ma, Yixue Hao, Min Chen, Jincai Chen, Ping Lu, Andrej Košir, “Audio-visual emotion fusion (AVEF): A deep efficient weighted approach,” Information Fusion, vol. 46, 2019, pp. 184-192.
[21] Jie Guo, Bin Song, Peng Zhang, Mengdi Ma, Wenwen Luo, Junmei lv, “Affective video content analysis based on multimodal data fusion in heterogeneous networks,” Information Fusion, vol. 51, 2019, pp. 224-232.
[22] Soujanya Poria, Erik Cambria, Newton Howard, Guang-Bin Huang, Amir Hussain, “Fusing audio, visual and textual clues for sentiment analysis from multimodal content,” Neurocomputing, vol. 174, Part A, 2016, pp. 50-59.
[23] F. Noroozi, M. Marjanovic, A. Njegus, S. Escalera and G. Anbarjafari, "Audio-Visual Emotion Recognition in Video Clips," IEEE Transactions on Affective Computing, vol. 10, 2019, pp. 60-75.
[24] S. Zhang, S. Zhang, T. Huang, W. Gao and Q. Tian, "Learning Affective Features with a Hybrid Deep Model for Audio–Visual Emotion Recognition," IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, 2018, pp. 3030-3043.
[25] Li, R., Liu, Z., “Stress detection using deep neural networks.,” BMC Medical Informatics and Decision Making vol. 20, 2020, pp. 285.
[26] P. Bobade and M. Vani, "Stress Detection with Machine Learning and Deep Learning using Multimodal Physiological Data," in 2020 Second International Conference on Inventive Research in Computing Applications (ICIRCA), 2020, pp. 51-57.
[27] Zhang, H., Feng, L., Li, N., Jin, Z., & Cao, L., “Video-Based Stress Detection through Deep Learning.,” Sensors, vol. 20(19), 2020, pp. 5552.
[28] I. J. Goodfellow et al., “Challenges in representation learning: A report on three machine learning contests,” Neural Networks, Special Issue on Deep Learning of Representations, vol. 64, 2015, pp. 59-63.
[29] Lyons, Michael, Kamachi, Miyuki, & Gyoba, Jiro, “The Japanese Female Facial Expression (JAFFE) Dataset,” Zenodo. 1998.
[30] Lucey et al., "The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression," in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, 2010, pp. 94-101.
[31] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," Proc. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, 2016.
[32] Mehmet Berkehan Akçay, Kaya Oğuz, “Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers, Speech Communication,” vol. 116, 2020, pp. 56-76.
[33] Zheng, F., Zhang, G. & Song, Z., “Comparison of different implementations of MFCC,” Journal of Computer Science & Technology, vol.16, 2001, pp. 582–589.
[34] Livingstone SR, Russo FA, “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English,” PLoS ONE vol 13(5), 2018.
[35] Pichora-Fuller, M. Kathleen; Dupuis, Kate, "Toronto emotional speech set (TESS)", Scholars Portal Dataverse, V1, 2020.
[36] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova; “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 2019.
[37] I. Mureşan, A. Stan, M. Giurgiu and R. Potolea, "Evaluation of sentiment polarity prediction using a dimensional and a categorical approach," in 7th Conference on Speech Technology and Human - Computer Dialogue (SpeD), 2013, pp. 1-6.
[38] Diman Ghazi, Diana Inkpen & Stan Szpakowicz, “Detecting Emotion Stimuli in Emotion-Bearing Sentences”. in 16th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2015), Cairo, Egypt.
[39] Li Yanran, Su Hui, Shen Xiaoyu, Li Wenjie, Cao Ziqiang, Niu Shuzi; “DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset,” in Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers),2017,nov,Asian Federation of Natural Language Processing,Taipei, Taiwan,Pages pp. 986-995.
[40] M. Wöllmer, M. Kaiser, F. Eyben, B. Schuller, G. Rigoll, “LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework,” Image and Vision Computing, vol. 31, 2013, pp. 153-163.
[41] S. Chen and Q. Jin. “Multi-Modal Dimensional Emotion Recognition Using Recurrent Neural Networks”, Proc. 5th International Workshop on Audio/Visual Emotion Challenge. AVEC ’15. Brisbane, Australia: Association for Computing Machinery, 2015, pp. 49-56.