TY - JOUR
T1 - Toward three-dimensional human action recognition using a convolutional neural network with correctness-vigilant regularizer
AU - Ren, Jun
AU - Reyes, Napoleon
AU - Barczak, Andre
AU - Scogings, Chris
AU - Liu, Mingzhe
N1 - Funding Information:
The first author was supported by the Chinese Scholarship Council. The last author was supported by the Youth Innovav-tion Research Team of Sichuan Province (No. 2015TD0020).
Publisher Copyright:
© 2018 SPIE and IS&T.
PY - 2018/8/7
Y1 - 2018/8/7
N2 - Human action recognition is one of the raison d'être for doing human-computer interaction research, as it is highly vital in meeting the demands of modern society, such as automatic video surveillance for security, patient monitoring for recovery, content-based video retrieval, etc. In line with this, deep learning systems are fast becoming the defacto standard for object recognition, video understanding, and pattern recognition due to their inherent powerful feature learning ability from vast amount of data. It makes sense to capitalize on its great success and to further improve it for the complex task of action recognition. One of the contributions in this paper is an effective and yet simple method for encoding the spatiotemporal information from skeleton sequences into what we call temporal kinematic images. In the input encoding scheme, we embed various geometric relational features derived from the skeleton sequence in the form of our proposed skeletal optical flows (SOFs). SOFs collectively represent the variations of kinetic energy, angles between limbs, and pair-wise displacements between joints over consecutive frames of skeleton data, as color variations in the temporal kinematic images. Another contribution is our convolutional neural network with a correctness-vigilant regularizer. It is employed to exploit the discriminative features from the temporal kinematic image for human action recognition. Lastly, we additionally investigated an adaptive label smoothing technique employed toward the end of training iterations. Empirical results show that the efficiency of the proposed method is superior to existing works in terms of the generalizability of the generated model, training convergence speed, and the resulting classification accuracy on nine popular benchmarking datasets, such as MHAD, MSR Activity 3D, HDM05, MSR Daily Activity 3D, and the latest challenging databases, such as UTKinect-Action, NTU RGB+D, Northwestern-UCLA, UWA3DII, and SBU Kinect Interaction datasets.
AB - Human action recognition is one of the raison d'être for doing human-computer interaction research, as it is highly vital in meeting the demands of modern society, such as automatic video surveillance for security, patient monitoring for recovery, content-based video retrieval, etc. In line with this, deep learning systems are fast becoming the defacto standard for object recognition, video understanding, and pattern recognition due to their inherent powerful feature learning ability from vast amount of data. It makes sense to capitalize on its great success and to further improve it for the complex task of action recognition. One of the contributions in this paper is an effective and yet simple method for encoding the spatiotemporal information from skeleton sequences into what we call temporal kinematic images. In the input encoding scheme, we embed various geometric relational features derived from the skeleton sequence in the form of our proposed skeletal optical flows (SOFs). SOFs collectively represent the variations of kinetic energy, angles between limbs, and pair-wise displacements between joints over consecutive frames of skeleton data, as color variations in the temporal kinematic images. Another contribution is our convolutional neural network with a correctness-vigilant regularizer. It is employed to exploit the discriminative features from the temporal kinematic image for human action recognition. Lastly, we additionally investigated an adaptive label smoothing technique employed toward the end of training iterations. Empirical results show that the efficiency of the proposed method is superior to existing works in terms of the generalizability of the generated model, training convergence speed, and the resulting classification accuracy on nine popular benchmarking datasets, such as MHAD, MSR Activity 3D, HDM05, MSR Daily Activity 3D, and the latest challenging databases, such as UTKinect-Action, NTU RGB+D, Northwestern-UCLA, UWA3DII, and SBU Kinect Interaction datasets.
UR - http://www.scopus.com/inward/record.url?scp=85051826124&partnerID=8YFLogxK
U2 - 10.1117/1.JEI.27.4.043040
DO - 10.1117/1.JEI.27.4.043040
M3 - Article
AN - SCOPUS:85051826124
SN - 1017-9909
VL - 27
JO - Journal of Electronic Imaging
JF - Journal of Electronic Imaging
IS - 4
M1 - 043040
ER -