Speech Emotion Recognition Using Scalogram Based Deep Structure

Document Type : Original Article


Department of Engineering and Technology, University of Mazandaran, Babolsar, Iran


Speech Emotion Recognition (SER) is an important part of speech-based Human-Computer Interface (HCI) applications. Previous SER methods rely on the extraction of features and training an appropriate classifier. However, most of those features can be affected by emotionally irrelevant factors such as gender, speaking styles and environment. Here, an SER method has been proposed based on a concatenated Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN). The CNN can be used to learn local salient features from speech signals, images, and videos. Moreover, the RNNs have been used in many sequential data processing tasks in order to learn long-term dependencies between the local features. A combination of these two gives us the advantage of the strengths of both networks. In the proposed method, CNN has been applied directly to a scalogram of speech signals. Then, the attention-mechanism-based RNN model was used to learn long-term temporal relationships of the learned features. Experiments on various data such as RAVDESS, SAVEE, and Emo-DB demonstrate the effectiveness of the proposed SER method.


1. ediou, B., Krolak-Salmon, P., Saoud, M., Henaff, M.
A., Burt, M., Dalery, J. and D'Amato, T.,  "Facial expression and
sex recognition in schizophrenia and depression," The Canadian
Journal of Psychiatry, Vol .50, No. 9, (2005), 525-533. 
2. Teixeira, T., Wedel, M. and Pieters, R., “Emotion-induced
engagement in internet video advertisements”, Journal of
Marketing Research, Vol . 49, No. 2, (2012),  144-159. 
3. Liu, Z., Wu, M., Cao, W., Chen, L., Xu, J., Zhang, R., Zhou, M.
and Mao, J., "A facial expression emotion recognition based
human-robot interaction system ,"IEEE/CAA Journal of
Automatica Sinica, Vol 4, No 4, (2017). 
4. El Ayadi, M., Kamel, M.S. and Karray, F., "Survey on speech
emotion recognition: Features, classification schemes, and
databases”,. Pattern Recognition, Vol  . 44, No. 3, (2011), 572587.
5. Kwon, O.W., Chan, K., Hao, J. and Lee, T.W., "Emotion
recognition by speech signals”, Eighth European Conference on
Speech Communication and Technology, (2003). 
6. Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C.M., Kazemz
adeh, A., Lee, S., Neumann, U. and Narayanan, S., “Analysis of
emotion recognition using facial expressions, speech and
multimodal information”,  the 6th International Conference on
Multimodal Interfaces, (2004), 205-211. 
7. Esmaileyan, Z. and Marvi, H., “A database for automatic Persian
speech emotion recognition: collection, processing and
evaluation”, International Journal of Engineering-
Transactions A: Basics, Vol. 27, No. 1, (2014), pp.79-90. 
8. Lin, Y.L., and Wei, G., "Speech emotion recognition based on
HMM and SVM", IEEE International Conference on Machine
Learning and Cybernetics. Vol. 8, (2005). 
9. Hu, Hao, Ming-Xing Xu, and Wei Wu. "GMM supervector based
SVM with spectral features for speech emotion recognition", 
IEEE International Conference on Acoustics, Speech and Signal
Processing-ICASSP'07. Vol. 4. IEEE, (2007). 
10. Chavhan, Yashpalsing, M. L. Dhore, and Pallavi Yesaware.
"Speech emotion recognition using support vector
machine." International Journal of Computer Applications 1.20
(2010), 6-9. 
11. El Ayadi, Moataz, Mohamed S. Kamel, and Fakhri Karray.
"Survey on speech emotion recognition: Features, classification
schemes, and databases." Pattern Recognition, Vol. 44, No. 3
(2011): 572-587. 
12. Haider, F., Poolak, S., Albert, P., and Luz, S., "Emotion
Recognition in Low-Resource Settings: An Evaluation of
Automatic Feature Selection Methods" arXiv preprint arXiv:
1908.10623 (2019). 

13. Pan, Yixiong, Peipei Shen, and Liping Shen. "Speech emotion 
recognition using support vector machine." International
Journal of Smart Home, Vol. 6, No. 2, (2012), 101-108. 
14. Nwe, Tin Lay, Say Wei Foo, and Liyanage C. De Silva. "Speech
emotion recognition using hidden Markov models." Speech
communication, Vol. 41, No. 4, (2003), 603-623. 
15. Shegokar, Pankaj, and Pradip Sircar. "Continuous wavelet
transform based speech emotion recognition." 2016 10th
International Conference on Signal Processing and
Communication Systems (ICSPCS). IEEE, 2016. 
16. Jin, Bicheng, and Gang Liu. "Speech Emotion Recognition Based
on Hyper-Prosodic Features." 2017 International Conference on
Computer Technology, Electronics and Communication
(ICCTEC). IEEE, 2017. 
17. Vasquez-Correa, Juan Camilo, et al. "Wavelet-based timefrequency
speech." Speech Communication; 12. ITG
Symposium. VDE, 2016. 
18. Gu, S., Holly, E., Lillicrap, T. and Levine, S., “Deep
reinforcement learning for robotic manipulation with
asynchronous off-policy updates”, IEEE International Conference
on Robotics and Automation (ICRA), (2017), 3389-3396. 
19. Ye, H., Li, G.Y. and Juang, B.H., “Power of deep learning for
channel estimation and signal detection in OFDM
systems”,. IEEE Wireless Communications Letters, Vol. 7, No.
1, (2017),  114-117. 
20. Zhang, F., Leitner, J., Milford, M., Upcroft, B. and
Corke, P., “Towards vision-based deep reinforcement learning for
robotic motion control" arXiv preprint arXiv:1511.03791, (2015). 
21. Liu, X., Liu, W., Mei, T. and Ma, H., “A deep learning-based
approach to progressive vehicle re-identification for urban
surveillance," In European Conference on Computer Vision,
(2016), pp. 869-884,. Springer, Cham. 
22. Yu, Z. and Zhang, C., “Image based static facial expression
recognition with multiple deep network learning ," International
Conference on Multimodal Interaction, (2015), 435-442. 
23. Hu, M., Wang, H., Wang, X., Yang, J. and Wang, R., “Video
facial emotion recognition based on local enhanced motion
history image and CNN-CTSLSTM networks " Journal of Visual
Communication and Image Representation, 59, (2019) , 176185.
24. Chen, L., Zhou, M., Su, W., Wu, M., She, J. and
Hirota, K., “Softmax regression based deep sparse autoencoder
network for facial emotion recognition in human-robot
interaction ," Information Sciences, 428, (2018) ,49-61. 
25. Baber, J., Bakhtyar, M., Ahmed, K.U., Noor, W., Devi, V. and
Sammad, A., “Facial Expression Recognition and Analysis of
Interclass False Positives Using CNN”, Future of Information and
Communication Conference, (2019), 46-54. 
26. Stolar, Melissa N., et al. "Real time speech emotion recognition
using RGB image classification and transfer learning." 2017 11th
International Conference on Signal Processing and
Communication Systems (ICSPCS). IEEE, 2017. 
27. Gustav Sto.Tomas, "Speech Emotion Recognition using
Convolutional Neural Networks."Thesis for M.S. in Audio
Communication and Technology, Technische Universitt at
Berlin, 2019. 

28. Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou
, M.A., Schuller, B. and Zafeiriou, S.,  ”Adieu features, end-toend
speech emotion recognition using a deep convolutional
recurrent network ,"IEEE international conference on
acoustics, speech and signal processing (ICASSP), (2016), 52005204.
29. Huang, Z., Dong, M., Mao, Q. and Zhan, Y., "Speech emotion
recognition using CNN", the 22nd ACM International Conference
on Multimedia, (2014) ,801-804. 
30. Yenigalla, P., Kumar, A., Tripathi, S., Singh, C., Kar, S. and
Vepa, J., “Speech Emotion Recognition Using Spectrogram &
Phoneme Embedding”,Interspeech, (2018), 3688-3692. 
31. Jiang, Pengxu, et al. "Parallelized Convolutional Recurrent
Neural Network With Spectral Features for Speech Emotion
Recognition." IEEE Access 7 (2019), 90368-90377. 
32. Etienne, C., Fidanza, G., Petrovskii, A., Devillers, L. and
Schmauch, B., “CNN+ LSTM architecture for speech emotion
recognition with data augmentation. arXiv preprint
arXiv:1802.05630, (2018). 
33. Zhao, J., Mao, X. and Chen, L., “Speech emotion recognition
using deep 1D & 2D CNN LSTM networks ,"Biomedical Signal
Processing and Control, Vol . 47, (2019) ,312-323. 
34. Hajarolasvadi, N. and Demirel, H., “3D CNN-Based Speech
Emotion Recognition Using K-Means Clustering and
Spectrograms ,"Entropy, Vol . 21, No. 5, (2019), 479. 
35. Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., & Xu, B. ,
“Attention-based bidirectional long short-term memory networks
for relation classification”, the 54th Annual Meeting of the
Association for Computational Linguistics, Vol. 2, (2016), 207212.
36. Chen, Mingyi, et al. "3-D convolutional recurrent neural networks
with attention model for speech emotion recognition." IEEE
Signal Processing Letters 25.10 (2018), 1440-1444. 
37. Haq, Sanaul, Philip JB Jackson, and J. Edge. "Speaker-dependent
audio-visual emotion recognition." AVSP. 2009. 
38. Burkhardt, Felix, et al. "A database of German emotional
speech." Ninth European Conference on Speech Communication
and Technology. 2005. 
39. Livingstone, Steven R., Katlyn Peck, and Frank
A. Russo. "Ravdess: The ryerson audio-visual database of
emotional speech and song." 22nd Annual Meeting of the
Canadian Society for Brain, Behaviour and Cognitive Science
(CSBBCS). 2012. 
40. Badshah, A., Rahim, N., Ullah, N., Ahmad, J., Muhammad, K.,
Lee, M. Y., Kwon, S., Baik, S. W.,  "Deep features-based speech
emotion recognition for smart affective services." Multimedia
Tools and Applications 78.5 (2019), 5571-5589. 
41. Mao, Q., Dong, M., Huang, Z., Zhan, Y., "Learning salient
features for speech emotion recognition using convolutional
neural networks." IEEE Transactions on Multimedia 16.8
(2014), 2203-2213. 
42. Zeng, Y., Mao, H., Peng, D., and Yi, Z., "Spectrogram based
multi-task audio classification." Multimedia Tools and
Applications 78.3 (2019), 3705-3722.