Submitted:
25 March 2025
Posted:
25 March 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Fusion of Handcrafted and Deep Learning Features: We combine handcrafted features with deep learning-derived representations to capture both explicit speech characteristics and complex emotional patterns. This fusion enhances the model’s accuracy and robustness, improving generalization across different emotional expressions and speaker variations.
- Handcrafted Features: We extract features such as Zero-Crossing Rate (ZCR), Mel-Frequency Cepstral Coefficients (MFCCs), spectral contrast, and Mel-spectrogram, which focus on key speech characteristics like pitch, tone, and energy fluctuations. These features provide valuable insights into emotional content, enhancing the model’s ability to distinguish subtle emotional variations.
- Multi-Stream Deep Learning Architecture: Our model employs three streams: 1D CNN, 1D CNN with Long Short-Term Memory (LSTM), and 1D CNN with Bidirectional LSTM (Bi-LSTM), which capture both local and global patterns in speech, providing a robust understanding of emotional nuances. The LSTM and Bi-LSTM streams improve the model’s ability to recognize emotions in speech sequences.
- Ensemble Learning with Soft Voting: We combine predictions from the three streams using an ensemble learning technique with soft voting, improving emotion classification by leveraging the strengths of each model.
- Improved Performance and Generalization: Data augmentation techniques such as noise addition, pitch modification, and time stretching enhance the model’s robustness and generalization, addressing challenges like speaker dependency and variability in emotional expressions. Our approach achieves impressive performance, with accuracies of 92.90%, 85.20%, 90.63%, 67.71%, and 69.25% for the SUBESCO, BanglaSER, merged SUBESCO and BanglaSER, RAVDESS, and EMODB datasets, respectively, demonstrating its superiority over traditional models.
2. Related Works
3. Datasets
3.1. SUBESCO Dataset
3.2. BanglaSER Dataset
3.3. SUBESCO and BanglaSER Merged Dataset
3.4. RAVDESS Dataset
3.5. EMODB Dataset
4. Materials and Methods
4.0.1. Data Augmentation
4.1. Feature Extraction
4.2. Deep Learning Model
4.2.1. 1D-CNN Approach
4.2.2. Integration of 1D-CNN and LSTM Approach
4.2.3. Integration of 1D-CNN and Bi-LSTM Approach
4.3. Soft Voting Ensemble Learning
5. Results
5.1. Ablation Study
5.2. Outcomes of the Models for SUBESCO Dataset
5.3. Outcomes of the Models for BanglaSER Dataset
5.4. Outcomes of the Models for SUBESCO and BanglaSER Merged Dataset
5.5. Outcomes of the Models for RAVDESS and EMODB Datasets
5.6. State of art Comparison
5.7. Discussion
6. Conclusion
Abbreviations
| PBCC | Phase-Based Cepstral Coefficients |
| DTW | Dynamic Time Warping |
| RMS | Root Means Square |
| MFCCS | Mel-frequency cepstral coefficients |
References
- Bashari Rad, B.; Moradhaseli, M. Speech emotion recognition methods: A literature review. AIP Conference Proceedings 2017, 1891, 020105–1. [Google Scholar]
- Ahlam Hashem, M.A.; Alghamdi, M. Speech emotion recognition approaches: A systematic review. Speech Communication 2023, 154, 102974. [Google Scholar] [CrossRef]
- Muntaqim, M.Z.; Smrity, T.A.; Miah, A.S.M.; Kafi, H.M.; Tamanna, T.; Farid, F.A.; Rahim, M.A.; Karim, H.A.; Mansor, S. Eye Disease Detection Enhancement Using a Multi-Stage Deep Learning Approach. IEEE Access. [CrossRef]
- Hossain, M.M.; Chowdhury, Z.R.; Akib, S.M.R.H.; Ahmed, M.S.; Hossain, M.M.; Miah, A.S.M. Crime Text Classification and Drug Modeling from Bengali News Articles: A Transformer Network-Based Deep Learning Approach. In Proceedings of the 2023 26th International Conference on Computer and Information Technology (ICCIT). IEEE; 2023; pp. 1–6. [Google Scholar]
- Rahim, M.A.; Farid, F.A.; Miah, A.S.M.; Puza, A.K.; Alam, M.N.; Hossain, M.N.; Karim, H.A. An Enhanced Hybrid Model Based on CNN and BiLSTM for Identifying Individuals via Handwriting Analysis. CMES-Computer Modeling in Engineering and Sciences 2024, 140. [Google Scholar] [CrossRef]
- Khalil, R.A.; Jones, E.; Babar, M.I.; Jan, T.; Zafar, M.H.; Alhussain, T. Speech emotion recognition using deep learning techniques: A review. IEEE Access 2019, 7, 117327–117345. [Google Scholar] [CrossRef]
- Saad, F.; Mahmud, H.; Shaheen, M.; Hasan, M.K.; Farastu, P. Is Speech Emotion Recognition Language-Independent? Analysis of English and Bangla Languages using Language-Independent Vocal Features. Computing Research Repository (CoRR) 2021. [Google Scholar]
- Chakraborty, C.; Dash*, T.K.; Panda, G.; Solanki, S.S. Phase-based Cepstral features for Automatic Speech Emotion Recognition of Low Resource Indian languages. Transactions on Asian and Low-Resource Language Information Processing 2022. [Google Scholar] [CrossRef]
- Ma, E. Data Augmentation for Audio. Medium 2019. [Google Scholar]
- RINTALA, J. Speech Emotion Recognition from Raw Audio using Deep Learning; School of Electrical Engineering and Computer Science Royal Institute of Technology (KTH), 2020.
- Tusher, M. M. R. F.A.K.H.M.M.A.S.M.R.S.R.I.M.R.M.A.M.S..K.H.A. BanTrafficNet: Bangladeshi Traffic Sign Recognition Using A Lightweight Deep Learning Approach. Computer Vision and Pattern Recognition.
- Siddiqua, A.; Hasan, R.; Rahman, A.; Miah, A.S.M. Computer-Aided Osteoporosis Diagnosis Using Transfer Learning with Enhanced Features from Stacked Deep Learning Modules. arXiv preprint arXiv:2412.09330, arXiv:2412.09330 2024.
- Md. Mahbubur Rahman Tusher, Fahmid Al Farid, M.A.H.A.S.M.M.S.R.R.M.H.J.S.M.M.A.R.H.A.K. Development of a Lightweight Model for Handwritten Dataset Recognition: Bangladeshi City Names in Bangla Script. Computers, Materials & Continua 2024, 80, 2633–2656. [Google Scholar]
- Sultana, S.; Iqbal, M.Z.; Selim, M.R.; Rashid, M.M.; Rahman, M.S. Bangla speech emotion recognition and cross-lingual study using deep CNN and BLSTM networks. IEEE Access 2021, 10, 564–578. [Google Scholar] [CrossRef]
- Rahman, M.M.; Dipta, D.R.; Hasan, M.M. Dynamic time warping assisted SVM classifier for Bangla speech recognition. In Proceedings of the 2018 International Conference on Computer, Communication, Chemical, Material and Electronic Engineering (IC4ME2); 2018; pp. 1–6. [Google Scholar]
- Issa, D.; Demirci, M.F.; Yazici, A. Speech emotion recognition with deep convolutional neural networks. Biomedical Signal Processing and Control 2020, 59, 101894. [Google Scholar] [CrossRef]
- Zhao, J.; Mao, X.; Chen, L. Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomedical signal processing and control 2019, 47, 312–323. [Google Scholar]
- Mustaqeem. ; Kwon, S. 1D-CNN: Speech emotion recognition system using a stacked network with dilated CNN features. CMC-Computers Materials & Continua 2021, 67, 4039–4059. [Google Scholar]
- Badshah, A.M.; Ahmad, J.; Rahim, N.; Baik, S.W. Speech emotion recognition from spectrograms with deep convolutional neural network. In Proceedings of the 2017 international conference on platform technology and service (PlatCon). IEEE; 2017; pp. 1–5. [Google Scholar]
- Etienne, C.; Fidanza, G.; Petrovskii, A.; Devillers, L.; Schmauch, B. Cnn+ lstm architecture for speech emotion recognition with data augmentation. arXiv preprint arXiv:1802.05630, arXiv:1802.05630 2018.
- Ai, X.; Sheng, V.S.; Fang, W.; Ling, C.X.; Li, C. Ensemble learning with attention-integrated convolutional recurrent neural network for imbalanced speech emotion recognition. IEEE Access 2020, 8, 199909–199919. [Google Scholar] [CrossRef]
- Kwon, S. A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 2019, 20, 183. [Google Scholar] [CrossRef]
- Zheng, W.; Yu, J.; Zou, Y. An experimental study of speech emotion recognition based on deep convolutional neural networks. In Proceedings of the 2015 international conference on affective computing and intelligent interaction (ACII). IEEE; 2015; pp. 827–831. [Google Scholar]
- Sultana, S.; Rahman, M.S.; Selim, M.R.; Iqbal, M.Z. SUST Bangla Emotional Speech Corpus (SUBESCO): An audio-only emotional speech corpus for Bangla. Plos one 2021, 16, e0250173. [Google Scholar] [CrossRef]
- Das, R.K.; Islam, N.; Ahmed, M.R.; Islam, S.; Shatabda, S.; Islam, A.M. BanglaSER: A speech emotion recognition dataset for the Bangla language. Data in Brief 2022, 42, 108091. [Google Scholar] [CrossRef]
- Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS one 2018, 13, e0196391. [Google Scholar] [CrossRef]
- Agnihotri, P. Berlin Database of Emotional Speech (EmoDB) Dataset. Accessed: December 2022. Kaggle, 2020. [Google Scholar]
- Ippolito, P.P. Data Augmentation Guide [2023 edition], 2023.
- McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.; Battenberg, E.; Nieto, O.; Dieleman, S.; Tokunaga, H.; McQuin, P. ; NumPy.; et al. librosa/librosa: 0.10.1. https://zenodo.org/records/8252662, 2023. [Google Scholar] [CrossRef]
- McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.; McVicar, M.; Battenberg, E.; Nieto, O. librosa: Audio and music signal analysis in python. In Proceedings of the Proceedings of the 14th python in science conference; 2015; 8, pp. 18–25. [Google Scholar]
- Harris, C.R.; Millman, K.J.; van der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J.; et al. Array programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef]
- Jordal, I. Polarity inversion, 2018.
- Jordal, I. Gain, 2018.
- Titeux, N. Everything you need to know about pitch shifting: Nicolas Titeux, 2023.
- Kedem, B. Spectral analysis and discrimination by zero-crossings. Proceedings of the IEEE 1986, 74, 1477–1493. [Google Scholar] [CrossRef]
- Bourtsoulatze, A. Audio Signal Feature Extraction for Analysis. Medium 2020. [Google Scholar]
- Shah, A.; Kattel, M.; Nepal, A.; Shrestha, D. Chroma Feature Extraction. In Proceedings of the Chroma Feature Extraction using Fourier Transform; 01 2019. [Google Scholar]
- Zaheer, N. Audio Signal Processing: How Machines Understand Audio Signals, 2023.
- Behera, K. Feature extraction from audio, 2020.
- Daehnhardt, E. Audio Signal Processing with python’s librosa, 2023.
- West, K.; Cox, S. Finding An Optimal Segmentation for Audio Genre Classification. 01 2005, pp. 680–685.
- Peeters, G. A large set of audio features for sound description (similarity and classification) in the CUIDADO project 2004.
- Fabien, M. Sound Feature Extraction, 2020.
- Saranga-K-Mahanta-google.; Arvindpdmn. Audio Feature Extraction. Devopedia, 2021.
- Wu, J. Introduction to convolutional neural networks. National Key Lab for Novel Software Technology. Nanjing University. China 2017, 5, 495. [Google Scholar]
- Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. Journal Of Big Data 2021, 8. [Google Scholar] [CrossRef] [PubMed]
- Zafar, A.; Aamir, M.; Mohd Nawi, N.; Arshad, A.; Riaz, S.; Alruban, A.; Dutta, A.; Alaybani, S. A Comparison of Pooling Methods for Convolutional Neural Networks. Applied Sciences 2022, 12, 8643. [Google Scholar] [CrossRef]
- Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, arXiv:1502.03167 2015.
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 2014, 15, 1929–1958. [Google Scholar]
- GeeksforGeeks. What is a Neural Network Flatten Layer?, 2024.
- Nwankpa, C.; Ijomah, W.; Gachagan, A.; Marshall, S. ctivation Functions: Comparison of trends in Practice and Research for Deep Learning. 2018; arXiv:cs.LG/1811.03378]. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. 2017; arXiv:cs.LG/1412.6980]. [Google Scholar]
- Mohammed, A.; Kora, R. A Comprehensive Review on Ensemble Deep Learning: Opportunities and Challenges. Journal of King Saud University - Computer and Information Sciences 2023, 35. [Google Scholar] [CrossRef]









| Dataset Name | Total Samples | Train/Test Ratio | Train Samples | Test Samples |
|---|---|---|---|---|
| SUBESCO | 7000 | 70/30 | 4900 | 2100 |
| BanglaSER | 1467 | 80/20 | 1173 | 294 |
| SUBESCO and BanglaSER merged | 8467 | 70/30 | 5926 | 2541 |
| RAVDESS (Audio-only) | 1248 | 70/30 | 1008 | 432 |
| EMODB | 535 | 80/20 | 428 | 107 |
| Augmentation Name | Description |
|---|---|
| Polarity Inversion | Reverses the phase of the audio signal by multiplying it by -1, effectively canceling the phase when combined with the original signal, resulting in silence [32]. |
| Noise Addition | Adds random white noise to the audio data to enhance its variability and robustness [9]. |
| Time Stretching | Alters the speed of the audio by stretching or compressing time series data, increasing or decreasing sound speed [9]. |
| Pitch Change | Changes the pitch of the audio signal by adjusting the frequency of sound components, typically by resampling [34]. |
| Sound Shifting | Randomly shifts the audio by a predefined number of seconds, introducing silence at the shifted location if necessary [9]. |
| Random Gain | Alters the loudness of the audio signal using a volume factor, making it louder or softer [33]. |
| Feature Name | Description and Advantage |
|---|---|
| Zero-Crossing Rate (ZCR) | Counts the number of times the audio signal crosses the horizontal axis. It helps analyze signal smoothness and is effective for distinguishing voiced from unvoiced speech [35] [36]. |
| Chromagram | Represents energy distribution over frequency bands corresponding to pitch classes in music. It captures harmonic and melodic features of the signal, useful for tonal analysis [37] [38]. |
| Spectral Centroid | Indicates the "center of mass" of a sound’s frequencies, providing insights into the brightness of the sound. It is useful for identifying timbral characteristics [39]. |
| Spectral Roll-off | Measures the frequency below which a certain percentage of the spectral energy is contained. This feature helps in distinguishing harmonic from non-harmonic content [39]. |
| Spectral Contrast | Measures the difference in energy between peaks and valleys in the spectrum, capturing timbral texture and distinguishing between different sound sources [40] [41]. |
| Spectral Flatness | Quantifies how noise-like a sound is. A high spectral flatness value indicates noise-like sounds, while a low value indicates tonal sounds, useful for identifying the type of sound [42]. |
| Mel-Frequency Cepstral Coefficients (MFCCs) | Captures spectral variations in speech, focusing on features most relevant to human hearing. It is widely used in speech recognition and enhances emotion recognition capabilities [40] [42]. |
| Root Mean Square (RMS) Energy | Measures the loudness of the audio signal, offering insights into the energy of the sound, which is crucial for understanding the emotional intensity [43]. |
| Mel-Spectrogram | Converts the frequencies of a spectrogram to the mel scale, representing the energy distribution in a perceptually relevant way, commonly used in speech and audio processing [44]. |
| Dataset | 1D CNN | 1D CNN LSTM | 1D CNN BiLSTM | Ensemble Learning |
|---|---|---|---|---|
| SUBESCO | 90.93% | 90.98% | 90.50% | 92.90% |
| BanglaSER | 83.67% | 84.52% | 81.97% | 85.20% |
| SUBESCO + BanglaSER | 88.92% | 88.61% | 87.56% | 90.63% |
| RAVDESS | 65.63% | 64.93% | 60.76% | 67.71% |
| EMODB | 69.57% | 67.39% | 65.84% | 69.25% |
| Model | Accuracy |
|---|---|
| 1D CNN | 90.93% |
| 1D CNN LSTM | 90.98% |
| 1D CNN BiLSTM | 90.50% |
| Ensemble Learning | 92.90% |
| Emotion | Accuracy (%) |
|---|---|
| Angry | 93.93 |
| Disgust | 84.98 |
| Fear | 93.61 |
| Happy | 94.98 |
| Neutral | 98.69 |
| Sad | 92.32 |
| Surprise | 92.05 |
| Class | Precision | Recall | F1-Score |
|---|---|---|---|
| Angry | 0.94 | 0.94 | 0.94 |
| Disgust | 0.92 | 0.85 | 0.88 |
| Fear | 0.93 | 0.93 | 0.93 |
| Happy | 0.91 | 0.95 | 0.93 |
| Neutral | 0.94 | 0.99 | 0.96 |
| Sad | 0.95 | 0.92 | 0.94 |
| Surprise | 0.91 | 0.92 | 0.92 |
| Macro Average | 0.93 | 0.93 | 0.93 |
| Weighted Average | 0.93 | 0.93 | 0.93 |
| Accuracy = 0.93 | |||
| Model | Accuracy |
|---|---|
| 1D CNN | 83.67% |
| 1D CNN LSTM | 84.52% |
| 1D CNN BiLSTM | 81.97% |
| Ensemble Learning | 85.20% |
| Emotion | Accuracy (%) |
|---|---|
| Angry | 93.07 |
| Happy | 76.61 |
| Neutral | 93.02 |
| Sad | 83.59 |
| Surprise | 81.67 |
| Class | Precision | Recall | F1-Score |
|---|---|---|---|
| Angry | 0.91 | 0.93 | 0.92 |
| Happy | 0.83 | 0.77 | 0.80 |
| Neutral | 0.89 | 0.93 | 0.91 |
| Sad | 0.85 | 0.84 | 0.84 |
| Surprise | 0.78 | 0.82 | 0.80 |
| Macro Average | 0.85 | 0.85 | 0.85 |
| Weighted Average | 0.85 | 0.85 | 0.85 |
| Accuracy = 0.85 | |||
| Model | Accuracy |
|---|---|
| 1D CNN | 88.92% |
| 1D CNN LSTM | 88.61% |
| 1D CNN BiLSTM | 87.56% |
| Ensemble Learning | 90.63% |
| Emotion | Accuracy (%) |
|---|---|
| Angry | 94.84 |
| Disgust | 86.19 |
| Fear | 92.69 |
| Happy | 89.03 |
| Neutral | 96.74 |
| Sad | 87.53 |
| Surprise | 87.15 |
| Class | Precision | Recall | F1-Score |
|---|---|---|---|
| Angry | 0.93 | 0.95 | 0.94 |
| Disgust | 0.91 | 0.86 | 0.88 |
| Fear | 0.92 | 0.93 | 0.92 |
| Happy | 0.89 | 0.89 | 0.89 |
| Neutral | 0.90 | 0.97 | 0.93 |
| Sad | 0.89 | 0.88 | 0.88 |
| Surprise | 0.90 | 0.87 | 0.89 |
| Macro Average | 0.91 | 0.91 | 0.91 |
| Weighted Average | 0.91 | 0.91 | 0.91 |
| Accuracy = 0.91 | |||
| Model | Accuracy |
|---|---|
| 1D CNN | 65.63% |
| 1D CNN LSTM | 64.93% |
| 1D CNN BiLSTM | 60.76% |
| Ensemble Learning | 67.71% |
| Model | Accuracy |
|---|---|
| 1D CNN | 69.57% |
| 1D CNN LSTM | 67.39% |
| 1D CNN BiLSTM | 65.84% |
| Ensemble Learning | 69.25% |
| Research | Features used | Model | Accuracy for 5 Datasets (%) | ||||
| SUBESCO (Bangla), | RAVDESS (American English) | EMO-DB | BanglaSER | (SUBESCO + BanglaSER) | |||
| Sultana et al. [14] | Mel-Spectrogram | Deep CNN and BLSTM | 86.9 | 82.7 | |||
| Rahman et al. (2018) [15] | MFCCs, MFCC derivatives | SVM with RBF, DTW | 86.08 (Bangla 12 speakers) | - | |||
| Chakraborty et al. (2022) [8] | PBCC | P Gradient Boosting Machine | 96 | 96 | |||
| Issa, et al. [16]. | MFCCs, Mel-spectrogram, Chroma-gram, Spectral contrast feature, Tonnetz representation | 1D CNN | 64.3% (IEMOCAP, 4 classes) | 71.61 (8 classes) | 95.71 | ||
| Zhao, et al.2019 [17]. | Log mel spectrogram | 1D CNN LSTM, 2D CNN LSTM | 89.16 (IEMOCAP dependent) | 52.14% (IEMOCAP independent) | 95.33 (Emo-Db dependent) | 95.89% (independent) | |
| Mustaqeem et al. [18] | Spectral analysis | 1D Dilated CNN with BiGRU | 72.75 | 78.01 | 91.14 | - | - |
| Badshah et al. (2017) [19] | Spectrograms | CNN (3 convolutional, 3 FC layers) | 56% | ||||
| Etienne, et al. (2018) [20] | High-level features, log-mel Spectrogram | CNN-LSTM (4 conv + 1 BLSTM layer) | 61.7% (Unweighted), 64.5% (Weighted) | ||||
| Proposed | ZCR, chroma-gram,RMS, spectral centroid, spectral roll-off, spectral contrast, spectral flatness, mel spectrogram, and MFCCs | Ensemble of 1D CNN, 1D CNN LSTM, and 1D CNN BiLSTM | 92.90 % (SUBESCO), | 67.71% (RAVDESS), | 69.25% (EMODB) | 85.20% (BanglaSER), | 90.63% (SUBESCO + BanglaSER) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
