Emotion recognition of human speech using deep learning method and MFCC features

Sumon Kumar Hazra; Romana Rahman Ema; Syed Md. Galib; Shalauddin Kabir; Nasim Adnan

doi:10.32620/reks.2022.4.13

Emotion recognition of human speech using deep learning method and MFCC features

Sumon Kumar Hazra, Romana Rahman Ema, Syed Md. Galib, Shalauddin Kabir, Nasim Adnan

Abstract

Subject matter: Speech emotion recognition (SER) is an ongoing interesting research topic. Its purpose is to establish interactions between humans and computers through speech and emotion. To recognize speech emotions, five deep learning models: Convolution Neural Network, Long-Short Term Memory, Artificial Neural Network, Multi-Layer Perceptron, Merged CNN, and LSTM Network (CNN-LSTM) are used in this paper. The Toronto Emotional Speech Set (TESS), Surrey Audio-Visual Expressed Emotion (SAVEE) and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) datasets were used for this system. They were trained by merging 3 ways TESS+SAVEE, TESS+RAVDESS, and TESS+SAVEE+RAVDESS. These datasets are numerous audios spoken by both male and female speakers of the English language. This paper classifies seven emotions (sadness, happiness, anger, fear, disgust, neutral, and surprise) that is a challenge to identify seven emotions for both male and female data. Whereas most have worked with male-only or female-only speech and both male-female datasets have found low accuracy in emotion detection tasks. Features need to be extracted by a feature extraction technique to train a deep-learning model on audio data. Mel Frequency Cepstral Coefficients (MFCCs) extract all the necessary features from the audio data for speech emotion classification. After training five models with three datasets, the best accuracy of 84.35 % is achieved by CNN-LSTM with the TESS+SAVEE dataset.

Keywords

speech emotion recognition (SER); deep learning method; advanced AI; mel frequency cepstral coefficients (MFCCs); audio data

Full Text:

PDF

References

Lin, Y. L. and Wei, G. Speech emotion recognition based on HMM and SVM. 2005 International Conference on Machine Learning and Cybernetics (ICMLC), 2005, vol. 8, pp. 4898-4901. DOI: 10.1109/ICMLC.2005.1527805.

Li, M., Yang, B., Levy, J., Stolcke, A., Rozgic, V., Matsoukas, S., Papayiannis, C., Bone, D. and Wang, C. Contrastive unsupervised learning for speech emotion recognition. ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6329-6333. DOI: 10.1109/ICASSP39728.2021.9413910.

Zhou, K., Sisman, B., Liu, R. and Li, H. Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset. ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 920-924). DOI: 10.1109/ICASSP39728.2021.9413391.

Pepino, L., Riera, P. and Ferrer, L. Emotion recognition from speech using wav2vec 2.0 embeddings. arXiv, 2021, vol. abs.2104.03502. DOI: 10.48550/arXiv.2104.03502.

Tripathi, S., Kumar, A., Ramesh, A., Singh, C. and Yenigalla, P. Deep learning-based emotion recognition system using speech features and transcriptions. arXiv, 2019, vol. abs.1906.05681. DOI: 10.48550/arXiv.1906.05681.

Abdel-Hamid, O., Mohamed, A. R., Jiang, H., Deng, L., Penn, G. and Yu, D. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on audio, speech, and language processing, 2014, vol. 22, iss. 10, pp. 1533-1545. DOI: 10.1109/TASLP.2014.2339736.

Yoon, S., Byun, S., Dey, S. and Jung, K. Speech emotion recognition using a multi-hop attention mechanism. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 2822-2826. DOI: 10.1109/ICASSP.2019.8683483.

Lim, W., Jang, D. and Lee, T. Speech emotion recognition using convolutional and recurrent neural networks. 2016 Asia-Pacific signal and information processing association annual summit and conference (APSIPA), 2016, pp. 1-4. DOI: 10.1109/APSIPA.2016.7820699.

Dolka, H., VM, A. X. and Juliet, S. Speech emotion recognition using ANN on MFCC features. 2021 3rd International Conference on Signal Processing and Communication (ICPSC), 2021, pp. 431-435. DOI: 10.1109/ICSPC51351.2021.9451810.

Atmaja, B. T., Shirai, K. and Akagi, M. Speech emotion recognition using speech features and word embedding. 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2019, pp. 519-523. DOI: 10.1109/APSIPAASC47483.2019.9023098.

Lian, Z., Tao, J., Liu, B., Huang, J., Yang, Z. and Li, R. Context-Dependent Domain Adversarial Neural Network for Multimodal Emotion Recognition. Interspeech, 2020, pp. 394-398. DOI: 10.21437/Interspeech.2020-1705.

Dash, A. K., Pradhan, R., Rout, J. K. and Ray, N. K. A constructive model for sentiment analysis of speech using deep learning. 2018 International Conference on Information Technology (ICIT), 2018, pp. 1-6. DOI: 10.1109/ICIT.2018.00013.

Yoon, S., Byun, S. and Jung, K. Multimodal speech emotion recognition using audio and text. 2018 IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 112-118. DOI: 10.1109/SLT.2018.8639583.

Zielonka, M., Piastowski, A., Czyżewski, A., Nadachowski, P., Operlejn, M. and Kaczor, K. Recognition of Emotions in Speech Using Convolutional Neural Networks on Different Datasets. Electronics, 2022, vol. 11, iss. 22, article no. 3831. DOI: 10.3390/electronics11223831.

Yaloveha, V., Podorozhniak, A. and Kuchuk, H. Convolutional neural network hyperparameter optimization applied to land cover classification. Radioelectronic and Computer Systems, 2022, no. 1, pp. 115-128. DOI: 10.32620/reks.2022.1.09.

Mono vs Stereo: The Complete Guide. Available at: https://www.hifireport.com/mono-vs-stereo-the-complete-guide/ (accessed Oct. 21, 2022).

Selvaraj, M., Bhuvana, R. and Padmaja, S. Human speech emotion recognition. International Journal of Engineering & Technology, 2016, vol. 8, no. 1, pp. 311-323. Available at: https://www.enggjournals.com/ijet/docs/IJET16-08-01-090.pdf. (accessed Oct. 01, 2022).

Chakroborty, S., Roy, A. and Saha, G. Fusion of a complementary feature set with MFCC for improved closed set text-independent speaker identification. 2006 IEEE International Conference on Industrial Technology, 2006, pp. 387-390. DOI: 10.1109/ICIT.2006.372388.

Varshini, P., Soundarya, R. et al. Speech Emotion Analyzer. International Journal of Research Publication and Reviews, 2021, vol 2, Iss. 8, pp. 1026-1034. Available at: https://www.ijrpr.com/uploads/V2ISSUE8/IJRPR1095.pdf. (accessed Oct. 01, 2022).

Dua, S., Kumar, S. S., Albagory, Y., Ramalingam, R., Dumka, A., Singh, R., Rashid, M., Gehlot, A., Alshamrani, S. S. and AlGhamdi, A. S. Developing a Speech Recognition System for Recognizing Tonal Speech Signals Using a Convolutional Neural Network. Applied Sciences, 2022, vol. 12, iss. 12, article no. 6223. DOI: 10.3390/app12126223.

Trinh Van, L., Dao Thi Le, T., Le Xuan, T. and Castelli, E. Emotional Speech Recognition Using Deep Neural Networks. Sensors, 2022, vol. 22, iss. 4, article no. 1414. DOI: 10.3390/s22041414.

Understanding Architecture of LSTM. Available at: https://www.analyticsvidhya.com/blog/2021/01/understanding-architecture-of-lstm/ (accessed Oct. 21, 2022).

Long short-term memory. Available at: https://en.wikipedia.org/wiki/Long_short-term_memory (accessed Oct. 21, 2022).

Artificial Neural Network Tutorial. Available at: https://www.javatpoint.com/artificial-neural-network (accessed Oct. 21, 2022).

Sahu, G. Multimodal speech emotion recognition and ambiguity resolution. arXiv, 2019, vol. abs.1904.06022. DOI: 10.48550/arXiv.1904.06022.

Livieris, I. E., Pintelas, E. and Pintelas, P. A CNN–LSTM model for gold price time-series forecasting. Neural computing and applications, 2020, vol. 32, iss. 23, pp.17351-17360. DOI: 10.1007/s00521-020-04867-x.

Toronto emotional speech set (TESS). Available at: https://www.kaggle.com/datasets/ejlok1/toronto-emotional-speech-set-tess (accessed Oct. 21, 2022).

Surrey Audio-Visual Expressed Emotion (SAVEE). Available at: https://www.kaggle.com/datasets/ejlok1/surrey-audiovisual-expressed-emotion-savee (accessed Oct. 21, 2022).

RAVDESS Emotional speech audio. Available at: https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio (accessed Oct. 21, 2022).

Pseudocode for the automated artificial neural network algorithm to produce the trained network library. Available at: https://www.researchgate.net/figure/Pseudocode-for-the-automated-artificial-neural-network-algorithm-to-produce-the-trained_fig1_325677157 (accessed Oct. 21, 2022).

DOI: https://doi.org/10.32620/reks.2022.4.13

Refbacks

There are currently no refbacks.

Username
Password
Remember me

RADIOELECTRONIC AND COMPUTER SYSTEMS

Emotion recognition of human speech using deep learning method and MFCC features

Abstract

Keywords

Full Text:

References

Refbacks