Indexing metadata

Emotion recognition of human speech using deep learning method and MFCC features


 
Dublin Core PKP Metadata Items Metadata for this Document
 
1. Title Title of document Emotion recognition of human speech using deep learning method and MFCC features
 
2. Creator Author's name, affiliation, country Sumon Kumar Hazra; Jashore University of Science and Technology; Bangladesh
 
2. Creator Author's name, affiliation, country Romana Rahman Ema; Jashore University of Science and Technology; Bangladesh
 
2. Creator Author's name, affiliation, country Syed Md. Galib; Jashore University of Science and Technology; Bangladesh
 
2. Creator Author's name, affiliation, country Shalauddin Kabir; Jashore University of Science and Technology; Bangladesh
 
2. Creator Author's name, affiliation, country Nasim Adnan; Jashore University of Science and Technology; Bangladesh
 
3. Subject Discipline(s)
 
3. Subject Keyword(s) speech emotion recognition (SER); deep learning method; advanced AI; mel frequency cepstral coefficients (MFCCs); audio data
 
3. Subject Subject classification 004.934.032.26
 
4. Description Abstract Subject matter: Speech emotion recognition (SER) is an ongoing interesting research topic. Its purpose is to establish interactions between humans and computers through speech and emotion. To recognize speech emotions, five deep learning models: Convolution Neural Network, Long-Short Term Memory, Artificial Neural Network, Multi-Layer Perceptron, Merged CNN, and LSTM Network (CNN-LSTM) are used in this paper. The Toronto Emotional Speech Set (TESS), Surrey Audio-Visual Expressed Emotion (SAVEE) and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) datasets were used for this system. They were trained by merging 3 ways TESS+SAVEE, TESS+RAVDESS, and TESS+SAVEE+RAVDESS. These datasets are numerous audios spoken by both male and female speakers of the English language. This paper classifies seven emotions (sadness, happiness, anger, fear, disgust, neutral, and surprise) that is a challenge to identify seven emotions for both male and female data. Whereas most have worked with male-only or female-only speech and both male-female datasets have found low accuracy in emotion detection tasks. Features need to be extracted by a feature extraction technique to train a deep-learning model on audio data. Mel Frequency Cepstral Coefficients (MFCCs) extract all the necessary features from the audio data for speech emotion classification. After training five models with three datasets, the best accuracy of 84.35 % is achieved by CNN-LSTM with the TESS+SAVEE dataset.
 
5. Publisher Organizing agency, location Національний аерокосмічний університет "Харківський авіаційний інститут"
 
6. Contributor Sponsor(s)
 
7. Date (YYYY-MM-DD) 2022-11-29
 
8. Type Status & genre Peer-reviewed Article
 
8. Type Type
 
9. Format File format PDF
 
10. Identifier Uniform Resource Identifier https://nti.khai.edu/ojs/index.php/reks/article/view/reks.2022.4.13
 
10. Identifier Digital Object Identifier (DOI) https://doi.org/10.32620/reks.2022.4.13
 
11. Source Title; vol., no. (year) Radioelectronic and Computer Systems; No 4 (2022): Radioelectronic and Computer Systems
 
12. Language English=en en
 
14. Coverage Geo-spatial location, chronological period, research sample (gender, age, etc.)
 
15. Rights Copyright and permissions Copyright (c) 2022 Sumon Kumar Hazra, Romana Rahman Ema, Syed Md. Galib, Shalauddin Kabir, Nasim Adnan