Multimodal Spatiotemporal Feature Map for Dynamic Gesture Recognition from Real Time Video Sequences

Document Type : Original Article

Authors

Department of Electronics and Communication Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram, Guntur AP, India

Abstract

The utilization of artificial intelligence and computer vision has been extensively explored in the context of human activity and behavior recognition. Numerous researchers have investigated and suggested various techniques for human action recognition (HAR) to accurately identify actions from real-time videos. Among these techniques, convolutional neural networks (CNNs) have emerged as the most effective and widely used for activity recognition. This work primarily focuses on the significance of spatial information in activity/action classification. To identify human actions and behaviors from large video datasets, this paper proposes a two-stream spatial CNN approach. One stream, based on RGB data, is fed with the spatial information from unprocessed RGB frames. The second stream is powered by graph-based visual saliency maps generated by GBVS (Graph-Based Visual Saliency) method. The outputs of the two spatial streams were combined using sum, max, average, and product feature fusion techniques. The proposed method is evaluated on well-known benchmark human action datasets, such as KTH, UCF101, HMDB51, NTU RGB-D, and G3D, to assess its performance Promising recognition rates were observed on all datasets.

Keywords

Main Subjects


  1. Afsar, P., Cortez, P. and Santos, H., "Automatic visual detection of human behavior: A review from 2000 to 2014", Expert Systems with Applications, Vol. 42, No. 20, (2015), 6935-6956. https://doi.org/10.1016/j.eswa.2015.05.023
  2. Zhou, F. and De la Torre, F., "Factorized graph matching", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 38, No. 9, (2015), 1774-1789. doi: 10.1109/TPAMI.2015.2501802.
  3. Yang, X. and Liu, Z.-Y., "Adaptive graph matching", IEEE Transactions on Cybernetics, Vol. 48, No. 5, (2017), 1432-1445. doi: 10.1109/TPAMI.2015.2501802.
  4. Popovici, V. and Thiran, J., "Adaptive kernel matching pursuit for pattern classification", in Proceedings of the IASTED International Conference on Artificial Intelligence and Applications, Acta Press. (2004), 235-239.
  5. Liu, H., Ju, Z., Ji, X., Chan, C.S., Khoury, M., Liu, H., Ju, Z., Ji, X., Chan, C.S. and Khoury, M., "A view-invariant action recognition based on multi-view space hidden markov models", Human Motion Sensing and Recognition: A Fuzzy Qualitative Approach, (2017), 251-267. doi: 10.1109/TPAMI.2015.2501802.
  6. Abaei Kashan, A., Maghsoudi, A., Shoeibi, N., Heidarzadeh, M. and Mirnia, K., "An automatic optic disk segmentation approach from retina of neonates via attention based deep network", International Journal of Engineering, Transactions A: Basics, Vol. 35, No. 4, (2022), 715-724. doi: 10.5829/IJE.2022.35.04A.11.
  7. Azimi, B., Rashno, A. and Fadaei, S., "Fully convolutional networks for fluid segmentation in retina images", in 2020 International Conference on Machine Vision and Image Processing (MVIP), IEEE. (2020), 1-7.
  8. Srihari, D., Kishore, P., Kumar, E.K., Kumar, D.A., Kumar, M.T.K., Prasad, M. and Prasad, C.R., "A four-stream convnet based on spatial and depth flow for human action classification using rgb-d data", Multimedia Tools and Applications, Vol. 79, (2020), 11723-11746. doi: 10.1109/TPAMI.2015.2501802.
  9. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T. and Serre, T., "Hmdb: A large video database for human motion recognition", in 2011 International conference on computer vision, IEEE. (2011), 2556-2563.
  10. Längkvist, M., Karlsson, L. and Loutfi, A., "A review of unsupervised feature learning and deep learning for time-series modeling", Pattern Recognition Letters, Vol. 42, No., (2014), 11-24. doi: 10.1109/TPAMI.2015.2501802.
  11. Simonyan, K. and Zisserman, A., "Two-stream convolutional networks for action recognition in videos", Advances in Neural Information Processing Systems, Vol. 27, (2014).
  12. Chegeni, M.K., Rashno, A. and Fadaei, S., "Convolution-layer parameters optimization in convolutional neural networks", Knowledge-Based Systems, Vol. 261, (2023), 110210. https://doi.org/10.1016/j.knosys.2022.110210
  13. Scherer, M., Magno, M., Erb, J., Mayer, P., Eggimann, M. and Benini, L., "Tinyradarnn: Combining spatial and temporal convolutional neural networks for embedded gesture recognition with short range radars", IEEE Internet of Things Journal, Vol. 8, No. 13, (2021), 10336-10346. https://doi.org/10.1162/neco.1997.9.8.1735
  14. Savadi Hosseini, M. and Ghaderi, F., "A hybrid deep learning architecture using 3d cnns and grus for human action recognition", International Journal of Engineering, Transactions B: Applications, Vol. 33, No. 5, (2020), 959-965. doi: 10.5829/ije.2020.33.05b.29.
  15. Soomro, K., Zamir, A.R. and Shah, M., "Ucf101: A dataset of 101 human actions classes from videos in the wild", arXiv preprint arXiv:1212.0402, (2012).
  16. Liu, H., Zhou, A., Dong, Z., Sun, Y., Zhang, J., Liu, L., Ma, H., Liu, J. and Yang, N., "M-gesture: Person-independent real-time in-air gesture recognition using commodity millimeter wave radar", IEEE Internet of Things Journal, Vol. 9, No. 5, (2021), 3397-3415. https://doi.org/10.1162/neco.1997.9.8.1735
  17. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R. and Fei-Fei, L., "Large-scale video classification with convolutional neural networks", in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. (2014), 1725-1732.
  18. Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R. and Toderici, G., "Beyond short snippets: Deep networks for video classification", in Proceedings of the IEEE conference on computer vision and pattern recognition. (2015), 4694-4702.
  19. Zohrevand, A., Imani, Z. and Ezoji, M., "Deep convolutional neural network for finger-knuckle-print recognition", International Journal of Engineering, Transactions A: Basics, Vol. 34, No. 7, (2021), 1684-1693. doi: 10.5829/IJE.2021.34.07A.12
  20. Parvez M, M., Shanmugam, J., Sangeetha, M. and Ghali, V., "Coded thermal wave imaging based defect detection in composites using neural networks", International Journal of Engineering, Transactions A: Basics, Vol. 35, No. 1, (2022), 93-101. doi: 10.5829/ije.2022.35.01A.08.
  21. Varol, G., Laptev, I. and Schmid, C., "Long-term temporal convolutions for action recognition", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40, No. 6, (2017), 1510-1517. https://doi.org/10.1162/neco.1997.9.8.1735
  22. Wang, P., Li, W., Wan, J., Ogunbona, P. and Liu, X., "Cooperative training of deep aggregation networks for rgb-d action recognition", in Proceedings of the AAAI conference on artificial intelligence. Vol. 32, (2018).
  23. Simonyan, K. and Zisserman, A., "Very deep convolutional networks for large-scale image recognition", arXiv preprint arXiv:1409.1556, (2014). doi: 10.1109/TPAMI.2015.2501802.
  24. Auli, M., Galley, M., Quirk, C. and Zweig, G., "Joint language and translation modeling with recurrent neural networks", in Proc. of EMNLP. (2013).
  25. Bloom, V., Makris, D. and Argyriou, V., "G3d: A gaming action dataset and real time action recognition evaluation framework", in 2012 IEEE Computer society conference on computer vision and pattern recognition workshops, IEEE. (2012), 7-12.
  26. Kishore, P., Kumar, D.A., Sastry, A.C.S. and Kumar, E.K., "Motionlets matching with adaptive kernels for 3-d indian sign language recognition", IEEE Sensors Journal, Vol. 18, No. 8, (2018), 3327-3337. doi: 10.5591/978-1-57735-516-8/IJCAI11-210.
  27. Ciresan, D.C., Meier, U., Masci, J., Gambardella, L.M. and Schmidhuber, J., "Flexible, high performance convolutional neural networks for image classification", in Twenty-second international joint conference on artificial intelligence, Citeseer. (2011).
  28. Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M.a., Senior, A., Tucker, P. and Yang, K., "Large scale distributed deep networks", Advances in Neural Information Processing Systems, Vol. 25, (2012).
  29. Girshick, R., Donahue, J., Darrell, T. and Malik, J., "Rich feature hierarchies for accurate object detection and semantic segmentation", in Proceedings of the IEEE conference on computer vision and pattern recognition. (2014), 580-587.
  30. Shahroudy, A., Liu, J., Ng, T.-T. and Wang, G., "Ntu rgb+ d: A large scale dataset for 3d human activity analysis", in Proceedings of the IEEE conference on computer vision and pattern recognition. (2016), 1010-1019.
  31. Hochreiter, S. and Schmidhuber, J., "Long short-term memory", Neural Computation, Vol. 9, No. 8, (1997), 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
  32. Shahroudy, A., Ng, T.-T., Gong, Y. and Wang, G., "Deep multimodal feature analysis for action recognition in rgb+ d videos", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40, No. 5, (2017), 1045-1058. doi: 10.1109/TPAMI.2017.2691321.