Hierarchical long short-term memory for action recognition based on 3D skeleton joints from Kinect sensor

Nur Awal Hidayanto; Adhi Prahara; Riky Dwi Puriyanto

Authors

Nur Awal Hidayanto Informatics Department, Universitas Ahmad Dahlan
Adhi Prahara Informatics Department, Universitas Ahmad Dahlan
Riky Dwi Puriyanto Electrical Engineering Department, Universitas Ahmad Dahlan

Abstract

Action recognition has been used in a wide range of applications such as human-computer interaction, intelligent video surveillance systems, video summarization, and robotics. Recognizing action is important for intelligent agents to understand, learn and interact with the environment. The recent technology that allows the acquisition of RGB+D and 3D skeleton data and a deep learning model's development significantly increases the action recognition model's performance. In this research, hierarchical Long Sort-Term Memory is proposed to recognize action based on 3D skeleton joints from Kinect sensor. The model uses the 3D axis of skeleton joints and groups each joint in the axis into parts, namely, spine, left and right arm, left and right hand, and left and right leg. To fit the hierarchically structured layers of LSTM, the parts are concatenated into spine, arms, hands, and legs and then concatenated into the body. The model crosses the body in each axis into a single final body and fed to the final layer to classify the action. The performance is measured using cross-view and cross-subject evaluation and achieves accuracy 0.854 and 0.837, respectively, from the 10 action classes of the NTU RGB+D dataset.

References

T. Huang, "Computer vision: Evolution and promise," 1996. Available at: Google Scholar

D. Wu, N. Sharma, and M. Blumenstein, "Recent advances in video-based human action recognition using deep learning: A review," in Proceedings of the International Joint Conference on Neural Networks, 2017, vol. 2017-May, pp. 2865â€“2872, doi: 10.1109/IJCNN.2017.7966210.

V. KrÃ¼ger, D. Kragic, A. Ude, and C. Geib, "The meaning of action: A review on action recognition and mapping," Adv. Robot., vol. 21, no. 13, pp. 1473â€“1501, 2007, doi: 10.1163/156855307782148578.

C. Chen, R. Jafari, and N. Kehtarnavaz, "A survey of depth and inertial sensor fusion for human action recognition," Multimed. Tools Appl., vol. 76, no. 3, pp. 4405â€“4425, Feb. 2017, doi: 10.1007/s11042-015-3177-1.

R. Vemulapalli, F. Arrate, and R. Chellappa, "Human action recognition by representing 3D skeletons as points in a lie group," in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2014, pp. 588â€“595, doi: 10.1109/CVPR.2014.82.

Yong Du, W. Wang, and L. Wang, "Hierarchical recurrent neural network for skeleton based action recognition," in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1110â€“1118, doi: 10.1109/CVPR.2015.7298714.

H. Wang, W. Wang, and L. Wang, "Hierarchical motion evolution for action recognition," in 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), 2015, pp. 574â€“578, doi: 10.1109/ACPR.2015.7486568.

J. Liu, A. Shahroudy, D. Xu, and G. Wang, "Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition," Springer, Cham, 2016, pp. 816â€“833. doi: 10.1007/978-3-319-46487-9_50

Y. Wang, S. Wang, J. Tang, N. O'Hare, Y. Chang, and B. Li, "Hierarchical Attention Network for Action Recognition in Videos," Jul. 2016. Available at: Google Scholar

X. Chai, Z. Liu, F. Yin, Z. Liu, and X. Chen, "Two streams Recurrent Neural Networks for Large-Scale Continuous Gesture Recognition," in 2016 23rd International Conference on Pattern Recognition (ICPR), 2016, pp. 31â€“36, doi: 10.1109/ICPR.2016.7899603.

Y. Du, Y. Fu, and L. Wang, "Representation Learning of Temporal Dynamics for Skeleton-Based Action Recognition," IEEE Trans. Image Process., vol. 25, no. 7, pp. 3010â€“3022, Jul. 2016, doi: 10.1109/TIP.2016.2552404.

B. Mahasseni and S. Todorovic, "Regularizing Long Short Term Memory with 3D Human-Skeleton Sequences for Action Recognition," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 3054â€“3062, doi: 10.1109/CVPR.2016.333.

W. Zhu et al., "Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks." 2016. Available at: Google Scholar

R. Vemulapalli and R. Chellappa, "Rolling rotations for recognizing human actions from 3D skeletal data," in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016, vol. 2016-Decem, pp. 4471â€“4479, doi: 10.1109/CVPR.2016.484.

Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid, "A New Representation of Skeleton Sequences for 3D Action Recognition," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4570â€“4579, doi: 10.1109/CVPR.2017.486.

S. Zhang, X. Liu, and J. Xiao, "On Geometric Features for Skeleton-Based Action Recognition Using Multilayer LSTM Networks," in 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), 2017, pp. 148â€“157, doi: 10.1109/WACV.2017.24.

T. Tsunoda, Y. Komori, M. Matsugu, and T. Harada, "Football Action Recognition Using Hierarchical LSTM," in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 155â€“163, doi: 10.1109/CVPRW.2017.25.

G. Zhu, L. Zhang, P. Shen, and J. Song, "Multimodal Gesture Recognition Using 3-D Convolution and Convolutional LSTM," IEEE Access, vol. 5, pp. 4517â€“4524, 2017, doi: 10.1109/ACCESS.2017.2684186.

I. Lee, D. Kim, S. Kang, and S. Lee, "Ensemble Deep Learning for Skeleton-Based Action Recognition Using Temporal Sliding LSTM Networks," in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 1012â€“1020, doi: 10.1109/ICCV.2017.115.

Chuankun Li, Pichao Wang, Shuang Wang, Yonghong Hou, and Wanqing Li, "Skeleton-based action recognition using LSTM and CNN," in 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), 2017, pp. 585â€“590, doi: 10.1109/ICMEW.2017.8026287.

S. Wei, Y. Song, and Y. Zhang, "Human skeleton tree recurrent neural network with joint relative motion feature for skeleton based action recognition," in 2017 IEEE International Conference on Image Processing (ICIP), 2017, pp. 91â€“95, doi: 10.1109/ICIP.2017.8296249.

P. Shukla, K. K. Biswas, and P. K. Kalra, "Recurrent Neural Network Based Action Recognition from 3D Skeleton Data," in 2017 13th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), 2017, pp. 339â€“345, doi: 10.1109/SITIS.2017.63.

W. Li, L. Wen, M.-C. Chang, S. N. Lim, and S. Lyu, "Adaptive RNN Tree for Large-Scale Human Action Recognition," in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 1453â€“1461, doi: 10.1109/ICCV.2017.161.

H. Gammulle, S. Denman, S. Sridharan, and C. Fookes, "Two Stream LSTM: A Deep Fusion Framework for Human Action Recognition," in 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), 2017, pp. 177â€“186, doi: 10.1109/WACV.2017.27.

J. Liu, G. Wang, P. Hu, L.-Y. Duan, and A. C. Kot, "Global Context-Aware Attention LSTM Networks for 3D Action Recognition," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3671â€“3680, doi: 10.1109/CVPR.2017.391.

D. Xu, X. Xiao, X. Wang, and J. Wang, "Human action recognition based on Kinect and PSO-SVM by representing 3D skeletons as points in lie group," in ICALIP 2016 - 2016 International Conference on Audio, Language and Image Processing - Proceedings, 2017, pp. 568â€“573, doi: 10.1109/ICALIP.2016.7846646.

W. Ding, K. Liu, B. Xu, and F. Cheng, "Skeleton-based human action recognition via screw matrices," Chinese J. Electron., vol. 26, no. 4, pp. 790â€“796, Jul. 2017, doi: 10.1049/cje.2017.06.012.

J. Liu, A. Shahroudy, D. Xu, A. C. Kot, and G. Wang, "Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates," IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 12, pp. 3007â€“3021, Dec. 2018, doi: 10.1109/TPAMI.2017.2771306.

S. Yan, J. S. Smith, W. Lu, and B. Zhang, "Hierarchical Multi-scale Attention Networks for action recognition," Signal Process. Image Commun., vol. 61, pp. 73â€“84, Feb. 2018, doi: 10.1016/J.IMAGE.2017.11.005.

A. Ullah, J. Ahmad, K. Muhammad, M. Sajjad, and S. W. Baik, "Action Recognition in Video Sequences using Deep Bi-Directional LSTM With CNN Features," IEEE Access, vol. 6, pp. 1155â€“1166, 2018, doi: 10.1109/ACCESS.2017.2778011.

X. Liu, Y. Li, and Q. Wang, "Multi-View Hierarchical Bidirectional Recurrent Neural Network for Depth Video Sequence Based Action Recognition," Int. J. Pattern Recognit. Artif. Intell., vol. 32, no. 10, p. 1850033, Oct. 2018, doi: 10.1142/S0218001418500337.

C. Li, Q. Zhong, D. Xie, and S. Pu, "Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation," Apr. 2018. doi: 10.24963/ijcai.2018/109

L. Wang, Y. Xu, J. Cheng, H. Xia, J. Yin, and J. Wu, "Human Action Recognition by Learning Spatio-Temporal Features With Deep Neural Networks," IEEE Access, vol. 6, pp. 17913â€“17922, 2018, doi: 10.1109/ACCESS.2018.2817253.

Y. Han, S. L. Chung, A. Ambikapathi, J. S. Chan, W. Y. Lin, and S. F. Su, "Robust Human Action Recognition Using Global Spatial-Temporal Attention for Human Skeleton Data," in Proceedings of the International Joint Conference on Neural Networks, 2018, vol. 2018-July, doi: 10.1109/IJCNN.2018.8489386.

G. Yao, T. Lei, and J. Zhong, "A review of Convolutional-Neural-Network-based action recognition," Pattern Recognit. Lett., vol. 118, pp. 14â€“22, Feb. 2019, doi: 10.1016/j.patrec.2018.05.018.

A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, "NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis," in IEEE Conference on Computer Vision and Pattern Recognition, 2016. doi: 10.1109/CVPR.2016.115

J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C. Kot, "NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding," IEEE Trans. Pattern Anal. Mach. Intell., 2019, doi: 10.1109/TPAMI.2019.2916873.

J. Chung, S. Ahn, and Y. Bengio, "Hierarchical Multiscale Recurrent Neural Networks," Sep. 2016. Available at: Google Scholar

Hierarchical long short-term memory for action recognition based on 3D skeleton joints from Kinect sensor

Authors

Abstract

References

Downloads

Published

Issue

Section

License

quicklinks

Information

Current Issue

template

tools

crossref

Developed By