A Hybrid Deep Learning Architecture Using 3D CNNs and GRUs for Human Action Recognition

Document Type : Original Article

Authors

Human-Computer Interaction lab., Department of Electrical and Computer Engineering, Tarbiat Modares University, Tehran, Iran

Abstract

Video contents have variations in temporal and spatial dimensions, and recognizing human actions requires considering the changes in both directions. To this end, convolutional neural networks (CNNs) and recurrent neural networks (RNNs) and their combinations have been used to tackle the video dynamics. However, a hybrid architecture usually results in a more complex model and hence a greater number of parameters to be optimized. In this study, we propose to use a stack of gated recurrent unit (GRU) layers on top of a two-stream inflated convolutional neural network. Raw frames and optical flow of the video are processed in the first and second streams, respectively. We first segment the video frames in order to be able to track the video contents in more details and by using 3D CNNs extract spatial-temporal features, called local features. We then import the sequence of local features to the GRU‌ network, and use a weighted averaging operator to aggregate the outcome of the two processing flows, called global features. The evaluations confirm acceptable results for the two HMDB51 and UCF101 datasets. The proposed method resulted in a 1.6% improvement in the classification accuracy of the HMDB51 challenging dataset compared to the best reported results.

Keywords


  
1. Hale, J., "More than 500 hours of content are now being uploaded
to youtube every minute", Santa Monica, CA: Tubefilter,  (2019)
https://www.tubefilter.com/2019/05/07/number-hours-videouploaded-to-youtube-per-minute/..
2. Goodfellow, I., Bengio, Y. and Courville, A., "Deep learning,
MIT press,  (2016). 
3. Bobick, A.F. and Davis, J.W., "The recognition of human
movement using temporal templates", IEEE Transactions on
Pattern Analysis and Machine Intelligence,  Vol. 23, No. 3,
(2001), 257-267. 
4. Hu, Y., Cao, L., Lv, F., Yan, S., Gong, Y. and Huang, T.S.,
"Action detection in complex scenes with spatial and temporal
ambiguities", in 2009 IEEE 12th International Conference on
Computer Vision, IEEE. (2009), 128-135. 
5. Roh, M.-C., Shin, H.-K. and Lee, S.-W., "View-independent
human action recognition with volume motion template on single
stereo camera", Pattern Recognition Letters,  Vol. 31, No. 7,
(2010), 639-647. 
6. Wang, H., Kläser, A., Schmid, C. and Liu, C.-L., "Action
recognition by dense trajectories", in CVPR 2011, IEEE. (2011),
3169-3176. 
7. Wang, H. and Schmid, C., "Action recognition with improved
trajectories", in Proceedings of the IEEE international conference
on computer vision. (2013), 3551-3558. 
8. Vig, E., Dorr, M. and Cox, D., "Space-variant descriptor sampling
for action recognition based on saliency and eye movements", in
European conference on computer vision, Springer. (2012), 8497.
9. Peng, X., Wang, L., Wang, X. and Qiao, Y., "Bag of visual words
and fusion methods for action recognition: Comprehensive study 
and good practice", Computer Vision and Image Understanding, 
Vol. 150, (2016), 109-125. 
10. Sivic, J. and Zisserman, A., "Video google: A text retrieval
approach to object matching in videos", in null, IEEE. (2003),
1470. 
11. Liu, L., Wang, L. and Liu, X., "In defense of soft-assignment
coding", in 2011 International Conference on Computer Vision,
IEEE.  (2011), 2486-2493. 
12. Perronnin, F., Sánchez, J. and Mensink, T., "Improving the fisher
kernel for large-scale image classification", in European
conference on computer vision, Springer. (2010), 143-156. 
13. Peng, X., Zou, C., Qiao, Y. and Peng, Q., "Action recognition
with stacked fisher vectors", in European Conference on
Computer Vision, Springer. (2014), 581-595. 
14. Fernando, B., Gavves, E., Oramas, J., Ghodrati, A. and
Tuytelaars, T., "Rank pooling for action recognition", IEEE
Transactions on Pattern Analysis and Machine Intelligence, 
Vol. 39, No. 4, (2016), 773-787. 
15. Fernando, B., Anderson, P., Hutter, M. and Gould, S.,
"Discriminative hierarchical rank pooling for activity
recognition", in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. (2016), 1924-1932. 
16. Simonyan, K. and Zisserman, A., "Two-stream convolutional
networks for action recognition in videos", in Advances in neural
information processing systems. (2014), 568-576. 
17. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. and Fei-Fei, L.,
"Imagenet: A large-scale hierarchical image database", in 2009
IEEE conference on computer vision and pattern recognition,
IEEE. (2009), 248-255. 
18. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X. and
Van Gool, L., "Temporal segment networks for action recognition
in videos", IEEE Transactions on Pattern Analysis and
Machine Intelligence,  Vol. 41, No. 11, (2018), 2740-2755. 
19. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X. and
Van Gool, L., "Temporal segment networks: Towards good
practices for deep action recognition", in European conference on
computer vision, Springer. (2016), 20-36. 
20. Diba, A., Sharma, V. and Van Gool, L., "Deep temporal linear
encoding networks", in Proceedings of the IEEE conference on
Computer Vision and Pattern Recognition. (2017), 2329-2338. 
21. Zhu, J., Zhu, Z. and Zou, W., "End-to-end video-level
representation learning for action recognition", in 2018 24th
International Conference on Pattern Recognition (ICPR), IEEE.
(2018), 645-650. 
22. Ji, S., Xu, W., Yang, M. and Yu, K., "3d convolutional neural
networks for human action recognition", IEEE Transactions on
Pattern Analysis and Machine Intelligence,  Vol. 35, No. 1,
(2012), 221-231. 
23. Tran, D., Bourdev, L., Fergus, R., Torresani, L. and Paluri, M.,
"Learning spatiotemporal features with 3d convolutional
networks", in Proceedings of the IEEE international conference
on computer vision. (2015), 4489-4497. 
24. Carreira, J. and Zisserman, A., "Quo vadis, action recognition? A
new model and the kinetics dataset", in proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. (2017),
6299-6308. 
25. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov,
D., Erhan, D., Vanhoucke, V. and Rabinovich, A., "Going deeper
with convolutions", in Proceedings of the IEEE conference on
computer vision and pattern recognition. (2015), 1-9. 
26. Kurmanji, M. and Ghaderi, F., "Hand gesture recognition from
rgb-d data using 2d and 3d convolutional neural networks: A
comparative study", Journal of AI and Data Mining,  Vol. 8, No.
2, (2020), 177-188. 
27. Hochreiter, S. and Schmidhuber, J., "Long short-term memory",
Neural Computation,  Vol. 9, No. 8, (1997), 1735-1780. 
28. Du, Y., Wang, W. and Wang, L., "Hierarchical recurrent neural
network for skeleton based action recognition", in Proceedings of
the IEEE conference on computer vision and pattern recognition.,
(2015), 1110-1118. 
29. Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H.
and Courville, A., "Describing videos by exploiting temporal
structure", in Proceedings of the IEEE international conference on
computer vision. (2015), 4507-4515. 
30. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov,
R., Zemel, R. and Bengio, Y., "Show, attend and tell: Neural
image caption generation with visual attention", in International
conference on machine learning. (2015), 2048-2057. 
31. Vinyals, O., Toshev, A., Bengio, S. and Erhan, D., "Show and tell:
A neural image caption generator", in Proceedings of the IEEE
conference on computer vision and pattern recognition. (2015),
3156-3164. 
32. Sutskever, I., Vinyals, O. and Le, Q.V., "Sequence to sequence
learning with neural networks", in Advances in neural
information processing systems. (2014), 3104-3112. 
33. Graves, A., Jaitly, N. and Mohamed, A.-r., "Hybrid speech
recognition with deep bidirectional lstm", in 2013 IEEE workshop
on automatic speech recognition and understanding, IEEE. 
(2013), 273-278. 
34. Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals,
O., Monga, R. and Toderici, G., "Beyond short snippets: Deep
networks for video classification", in Proceedings of the IEEE
conference on computer vision and pattern recognition. (2015),
4694-4702. 
35. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T. and Serre, T.,
"Hmdb: A large video database for human motion recognition",
in 2011 International Conference on Computer Vision, IEEE.
(2011), 2556-2563. 
36. Soomro, K., Zamir, A.R. and Shah, M., "Ucf101: A dataset of 101
human actions classes from videos in the wild", arXiv preprint
arXiv:1212.0402,  (2012). 
37. Pérez, J.S., Meinhardt-Llopis, E. and Facciolo, G., "Tv-l1 optical
flow estimation", Image Processing On Line,  Vol. 2013, (2013),
137-150. 
38. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C.,
Vijayanarasimhan, S., Viola, F., Green, T., Back, T. and Natsev, 
P., "The kinetics human action video dataset", arXiv preprint
arXiv:1705.06950,  (2017). 
39. Wang, L., Koniusz, P. and Huynh, D.Q., "Hallucinating bag-ofwords
and fisher vector idt terms for cnn-based action
recognition", arXiv preprint arXiv:1906.05910,  (2019). 
40. Wang, J., Cherian, A., Porikli, F. and Gould, S., "Video
representation learning using discriminative pooling", in
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. (2018), 1149-1158. 
41. Roy, D., Murty, K.S.R. and Mohan, C.K., "Unsupervised
universal attribute modeling for action recognition", IEEE
Transactions on Multimedia,  Vol. 21, No. 7, (2018), 1672-1680. 
42. Murtaza, F., HaroonYousaf, M. and Velastin, S.A., "Da-vlad:
Discriminative action vector of locally aggregated descriptors for
action recognition", in 2018 25th IEEE International Conference
on Image Processing (ICIP), IEEE. (2018), 3993-3997. 
43. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y. and Paluri,
M., "A closer look at spatiotemporal convolutions for action
recognition", in Proceedings of the IEEE conference on Computer
Vision and Pattern Recognition. (2018), 6450-6459. 
44. Zhu, Y., Lan, Z., Newsam, S. and Hauptmann, A., "Hidden twostream
convolutional
networks
for
action
recognition",
in
Asian
Conference
on
Computer
Vision,
Springer.
(2018),
363-378.
45. Sun, S., Kuang, Z., Sheng, L., Ouyang, W. and Zhang, W.,
"Optical flow guided feature: A fast and robust motion
representation for video action recognition", in Proceedings of the
IEEE conference on computer vision and pattern recognition.
(2018), 1390-1399. 
46. Choutas, V., Weinzaepfel, P., Revaud, J. and Schmid, C., "Potion:
Pose motion representation for action recognition", in
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. (2018), 7024-7033. 
47. Diba, A., Fayyaz, M., Sharma, V., Paluri, M., Gall, J.,
Stiefelhagen, R. and Van Gool, L., "Holistic large scale video
understanding", arXiv preprint arXiv:1904.11451,  (2019). 
48. Yang, K., Fu, J., Guo, X., Lu, Y., Qiao, P., Li, D. and Dou, Y.,
"If-ttn: Information fused temporal transformation network for
video action recognition", arXiv preprint arXiv:1902.09928, 
(2019). 
49. Zheng, Z., An, G., Wu, D. and Ruan, Q., "Spatial-temporal
pyramid based convolutional neural network for action
recognition", Neurocomputing,  Vol. 358, (2019), 446-455.